Evaluation of different machine learning approaches and input text representations for multilingual classification of tweets for disease surveillance in the social web

Journal of Big Data

Table 1 Corpus summary

Language	#Speakers (millions)	#Territories spoken	Language family	#Tweets	#Retained	Yield	Split
Language	#Speakers (millions)	#Territories spoken	Language family	#Tweets	#Retained	Yield	% Positive	% Negative
English	510	> 50	Indo-European	13,004	–	–	0.57	0.43
French	270	> 30	Indo-European	510	399	0.78	0.33	0.67
German	220	> 10	Indo-European	547	394	0.73	0.36	0.64
Spanish	420	> 20	Indo-European	523	329	0.65	0.67	0.33
Arabic	255	> 30	Afro-Asiatic	503	283	0.57	0.30	0.70
Japanese	127	1	Japonic	553	394	0.72	0.28	0.72