Skip to main content

Table 1 Pre-trained word embedding models

From: Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Pre-trained word embedding

Number of documents

Sources

Techniques

Availability

Pre-processing

fastText [45]

400 millions tokens from Wikipedia”, i.e. 400 million Wikipedia articles + “24 terabytes of raw text data” from Common Crawl

Common Crawl and Wikipedia

CBOW with sub-wording techniques applied to the methods

Open

Only tokenization

AraVec [34]

66.9 million tweets and 320,636 documents from Wikipedia

Twitter and Wikipedia

CBOW and Skip-Gram with different n-gram and unigram features

Open

Remove non-Arabic letters. replace ة with ه. Normalize alef. remove duplicates, Normalize mentions, URLs emojis

Mazajak [46]

250 million tweets

Twitter

CBOW and Skip-Gram with different n-gram

Open

Removal URLs, Tashkeel, emojis and punctuation

ArWordVec [43]

55 million tweets

Twitter

CBOW and Skip-Gram

Open

Normalize mentions, URLs. Remove tashkeel, punctuation, Normalize bare alef

Replace ى" with "ي", Replace ؤ" with “ء",

Replace ئ" with ء", Replace " ة with ه "