Skip to main content

Table 5 Accuracy (in percentages) of each of the pre-processing techniques used for the extracted tweets

From: Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

 

Techniques used

MNB

Logistic regression

LinearSVC

KNN

 

Baseline models

86.0

84.0

84.0

77.6

1

Remove non-Arabic letters

85.4 −

83.6 −

82.8 −

76.7 −

2

Remove numbers

85.5 −

82.9 −

84.3

77.4 −

3

Remove usernames, external links, and hashtags

85.2 −

83.2 −

83.4 −

78.1 + 

4

Remove punctuation

86.0

84.0

84.0

77.6

5

Remove diacritics

86.0

83.6

83.8 −

76.6 −

6

Remove repeated characters

86.4 + 

84.3 + 

84.9 + 

79.2 + 

7

Remove duplicate letters

86.0

83.2 −

84.1 + 

78.7 + 

8

Remove Kashida

86.3 + 

83.8 −

84.6 + 

78.0 + 

9

Replace أ,إ, and آ with ا

85.8 −

83.6 −

84.1 + 

77.4 −

10

Replace ى with ي

86.7 + 

84.0

84.6 + 

77.9 + 

11

Replace ي and ئ with ى

86.8 + 

84.0

84.2 + 

78.0 + 

12

Replace ىء and ئ with ي

86.0

83.0 −

84.3 + 

77.8 + 

13

Replace ؤ and ئ with ء

85.8 −

83.8 −

83.9

77.7 + 

14

Replace ئ with ى

86.0

84.0

84.3 + 

77.6

15

Replace ة with ه

86.7 + 

83.8 −

84.8 + 

77.1 −

16

Replace چ with ج

86.0

84.0

84.0

77.6

17

Replace ڤ with ف

86.0

82.8 + 

84.0

77.6

18

Replace ءى and ءي with ئ

86.0

84.0

84.0

77.6

19

Replace ص with س

85.7 −

83.7 −

83.6 −

78.0 + 

20

Replace ض with ظ

86.0

83.6 −

84.0

77.9 + 

21

Replace ؤ with و

85.8 −

82.8 −

84.2 + 

77.6

22

Replace كـ with ك

86.0

84.0

84.2 + 

77.6

23

Remove stop words

85.2 −

84.4 + 

83.4 −

76.6 −

24

Light Stemming

86.6 + 

85.3 + 

86.2 + 

79.1 + 

25

Root stemming

84.4 −

85.2 + 

85.1 + 

77.8 + 

26

Lemmatization

86.7 + 

86.2 + 

86.5 + 

80.1 + 

  1. Plus sign ( +) indicate the technique improved the F1−score of the baseline model; negative sign(−) indicate the technique decreased the F1-score; and cells without sign indicate the technique had no impact on the F1-score of the algorithm