Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal of Big Data

Table 1 Pre-trained word embedding models

Pre-trained word embedding	Number of documents	Sources	Techniques	Availability	Pre-processing
fastText [45]	400 millions tokens from Wikipedia”, i.e. 400 million Wikipedia articles + “24 terabytes of raw text data” from Common Crawl	Common Crawl and Wikipedia	CBOW with sub-wording techniques applied to the methods	Open	Only tokenization
AraVec [34]	66.9 million tweets and 320,636 documents from Wikipedia	Twitter and Wikipedia	CBOW and Skip-Gram with different n-gram and unigram features	Open	Remove non-Arabic letters. replace ة with ه. Normalize alef. remove duplicates, Normalize mentions, URLs emojis
Mazajak [46]	250 million tweets	Twitter	CBOW and Skip-Gram with different n-gram	Open	Removal URLs, Tashkeel, emojis and punctuation
ArWordVec [43]	55 million tweets	Twitter	CBOW and Skip-Gram	Open	Normalize mentions, URLs. Remove tashkeel, punctuation, Normalize bare alef Replace ى" with "ي", Replace ؤ" with “ء", Replace ئ" with ء", Replace " ة with ه "