Pre-trained word embedding | Number of documents | Sources | Techniques | Availability | Pre-processing |
---|---|---|---|---|---|
fastText [45] | 400 millions tokens from Wikipedia”, i.e. 400 million Wikipedia articles + “24 terabytes of raw text data” from Common Crawl | Common Crawl and Wikipedia | CBOW with sub-wording techniques applied to the methods | Open | Only tokenization |
AraVec [34] | 66.9 million tweets and 320,636 documents from Wikipedia | Twitter and Wikipedia | CBOW and Skip-Gram with different n-gram and unigram features | Open | Remove non-Arabic letters. replace ة with ه. Normalize alef. remove duplicates, Normalize mentions, URLs emojis |
Mazajak [46] | 250 million tweets | CBOW and Skip-Gram with different n-gram | Open | Removal URLs, Tashkeel, emojis and punctuation | |
ArWordVec [43] | 55 million tweets | CBOW and Skip-Gram | Open | Normalize mentions, URLs. Remove tashkeel, punctuation, Normalize bare alef Replace ى" with "ي", Replace ؤ" with “ء", Replace ئ" with ء", Replace " ة with ه " |