Skip to main content

Table 4 Pre-trained model features

From: Text based personality prediction from multiple social media data sources using pre-trained language model and model averaging

Pre-trained Model Description Max sequence length used
BERT The model has been trained using a very large data corpus that includes words from the Wikipedia site totaling 2.5 billion words and a dictionary containing 800 million words. The architecture consists of 12 encorder layers, 768 hidden units, and 12 attention heads. [13] 512
RoBERTa RoBERTa is an extension of BERT, by adding a total of 16 GB of data from Wikipedia sources as well as additional data including the CommonCrawl News dataset (63 million articles, 76 GB), Web text corpus (38 GB) and Stories from Common Crawl (31 GB). The same architecture as BERT applied in this model. [27] 512
XLNet XLNet is another development from BERT. This model introduces a permutation language modeling, where all tokens are predicted but in random order. This differs from the BERT language model where only 15% mask tokens are predicted. However, the number of layers, hidden units, and attention heads still the same as BERT. [40] 512
Total Pre-trained Model Features 1536