Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal of Big Data

Table 5 Accuracy (in percentages) of each of the pre-processing techniques used for the extracted tweets

	Techniques used	MNB	Logistic regression	LinearSVC	KNN
	Baseline models	86.0	84.0	84.0	77.6
1	Remove non-Arabic letters	85.4 −	83.6 −	82.8 −	76.7 −
2	Remove numbers	85.5 −	82.9 −	84.3	77.4 −
3	Remove usernames, external links, and hashtags	85.2 −	83.2 −	83.4 −	78.1 +
4	Remove punctuation	86.0	84.0	84.0	77.6
5	Remove diacritics	86.0	83.6	83.8 −	76.6 −
6	Remove repeated characters	86.4 +	84.3 +	84.9 +	79.2 +
7	Remove duplicate letters	86.0	83.2 −	84.1 +	78.7 +
8	Remove Kashida	86.3 +	83.8 −	84.6 +	78.0 +
9	Replace أ,إ, and آ with ا	85.8 −	83.6 −	84.1 +	77.4 −
10	Replace ى with ي	86.7 +	84.0	84.6 +	77.9 +
11	Replace ي and ئ with ى	86.8 +	84.0	84.2 +	78.0 +
12	Replace ىء and ئ with ي	86.0	83.0 −	84.3 +	77.8 +
13	Replace ؤ and ئ with ء	85.8 −	83.8 −	83.9	77.7 +
14	Replace ئ with ى	86.0	84.0	84.3 +	77.6
15	Replace ة with ه	86.7 +	83.8 −	84.8 +	77.1 −
16	Replace چ with ج	86.0	84.0	84.0	77.6
17	Replace ڤ with ف	86.0	82.8 +	84.0	77.6
18	Replace ءى and ءي with ئ	86.0	84.0	84.0	77.6
19	Replace ص with س	85.7 −	83.7 −	83.6 −	78.0 +
20	Replace ض with ظ	86.0	83.6 −	84.0	77.9 +
21	Replace ؤ with و	85.8 −	82.8 −	84.2 +	77.6
22	Replace كـ with ك	86.0	84.0	84.2 +	77.6
23	Remove stop words	85.2 −	84.4 +	83.4 −	76.6 −
24	Light Stemming	86.6 +	85.3 +	86.2 +	79.1 +
25	Root stemming	84.4 −	85.2 +	85.1 +	77.8 +
26	Lemmatization	86.7 +	86.2 +	86.5 +	80.1 +

Plus sign ( +) indicate the technique improved the F₁−score of the baseline model; negative sign(−) indicate the technique decreased the F₁-score; and cells without sign indicate the technique had no impact on the F₁-score of the algorithm