Skip to main content

Table 1 Corpus summary

From: Evaluation of different machine learning approaches and input text representations for multilingual classification of tweets for disease surveillance in the social web

Language #Speakers (millions) #Territories spoken Language family #Tweets #Retained Yield Split
% Positive % Negative
English 510  > 50 Indo-European 13,004 0.57 0.43
French 270  > 30 Indo-European 510 399 0.78 0.33 0.67
German 220  > 10 Indo-European 547 394 0.73 0.36 0.64
Spanish 420  > 20 Indo-European 523 329 0.65 0.67 0.33
Arabic 255  > 30 Afro-Asiatic 503 283 0.57 0.30 0.70
Japanese 127 1 Japonic 553 394 0.72 0.28 0.72