Skip to main content

Table 1 Corpus summary

From: Evaluation of different machine learning approaches and input text representations for multilingual classification of tweets for disease surveillance in the social web

Language

#Speakers (millions)

#Territories spoken

Language family

#Tweets

#Retained

Yield

Split

% Positive

% Negative

English

510

 > 50

Indo-European

13,004

–

–

0.57

0.43

French

270

 > 30

Indo-European

510

399

0.78

0.33

0.67

German

220

 > 10

Indo-European

547

394

0.73

0.36

0.64

Spanish

420

 > 20

Indo-European

523

329

0.65

0.67

0.33

Arabic

255

 > 30

Afro-Asiatic

503

283

0.57

0.30

0.70

Japanese

127

1

Japonic

553

394

0.72

0.28

0.72