Text based personality prediction from multiple social media data sources using pre-trained language model and model averaging

The ever-increasing social media users has dramatically contributed to significant growth as far as the volume of online information is concerned. Often, the contents that these users put in social media can give valuable insights on their personalities (e.g., in terms of predicting job satisfaction, specific preferences, as well as the success of professional and romantic relationship) and getting it without the hassle of taking formal personality test. Termed personality prediction, the process involves extracting the digital content into features and mapping it according to a personality model. Owing to its simplicity and proven capability, a well-known personality model, called the big five personality traits, has often been adopted in the literature as the de facto standard for personality assessment. To date, there are many algorithms that can be used to extract embedded contextualized word from textual data for personality prediction system; some of them are based on ensembled model and deep learning. Although useful, existing algorithms such as RNN and LSTM suffers from the following limitations. Firstly, these algorithms take a long time to train the model owing to its sequential inputs. Secondly, these algorithms also lack the ability to capture the true (semantic) meaning of words; therefore, the context is slightly lost. To address these aforementioned limitations, this paper introduces a new prediction using multi model deep learning architecture combined with multiple pre-trained language model such as BERT, RoBERTa, and XLNet as features extraction method on social media data sources. Finally, the system takes the decision based on model averaging to make prediction. Unlike earlier work which adopts a single social media data with open and close vocabulary extraction method, the proposed work uses multiple social media data sources namely Facebook and Twitter and produce a predictive model for each trait using bidirectional context feature combine with extraction method. Our experience with the proposed work has been encouraging as it has outperformed similar existing works in the literature. More precisely, our results achieve a maximum accuracy of 86.2% and 0.912 f1 measure score on the Facebook dataset; 88.5% accuracy and 0.882 f1 measure score on the Twitter dataset.


Introduction
In recent years, information growth has proliferated in accelerating pace in line with the advent of social media especially in the form of textual data types. According to the Social Media Trend report published in [39], there are 3.8 billion active users of social media in the world as of January 2020, with a projected increase of 9.2% of users each year. Often, people use social media to express themselves on certain issues related to their lives and family well beings, psychology, financial issues, interaction with societies and environment, as well as politics. In some cases, these expressions can be used to characterize the individual behavior and personality. In fact, earlier studies (e.g. [4,11,16,18,24,25]) demonstrate that there is a strong correlation between user personalities and their online behavior on social media. Some examples of applications that can take advantage from the user personality information include recruitment systems, personal counseling systems, online marketing, personal recommendation systems, and bank credit scoring systems to name a few [5,12].
Owing to the inherent ambiguities of natural languages, developing an effective personality prediction model based on the textual message that user shares on social media can be a painstakingly difficult task. Dealing with these ambiguities, much progress has been achieved in the field of Natural Language Processing (NLP). To-date, NLP has enabled computers to understand words or sentences written in human language [6]. Linguistic concepts such as part-of-speech (nouns, verbs, adjectives) and grammatical structures are usually used in NLP [35]. Apart from part-of-speech and structural grammar, NLP is also able to deal with the anaphors and ambiguities that often arise in a language via knowledge representations such as a dictionary of words and their meanings, sentence structure, grammar rules, and other information such as synonyms or abbreviations [24].
Automatic personality prediction has become a widely discussed topic for researchers in the NLP community. References [13,29] shows that personality can be defined as a pattern of influence or personality used to characterize unique individuals. The existing personality prediction exploited deep learning and machine learning algorithm along with open vocabulary feature extraction to improve classification accuracy. However, this approach has limitations to extract contextual features in the sentence due to limitedness of computational algorithm and out of vocabulary problem by using pre-defined corpus. Moreover, the small number dataset used in building personality prediction system especially using deep learning algorithm to be the main obstacle to maximize the model performance [5,29,37]. Addressing the aforementioned issues, this paper proposes a multi model deep learning architecture which build on top of different pre-trained language model called Bidirectional Encoder from Transformer (BERT), A Robustly Optimized BERT Pretraining Approach (RoBERTa), and XLNet a Generalized Autoregressive Pretraining for Language Understanding to capture the contextual meaning of a text-based data from social media. Later, the text-based data will be added with another additional NLP Features such as Sentiment Analysis, Term Frequency-Inverse Gravity Moment (TF-IGM), and National Research Council (NRC) Emotion Lexicon Database as features to build a multi model deep learning architecture to predict the personality traits. The contributions of this work can be summarized as follows: • We proposed a multi model deep learning architecture with a pre-trained language model BERT, RoBERTa, and XLNet; along with additional NLP Features (sentiment analysis, TF-IGM, NRC emotion lexicon database) as features extraction method for personality prediction system. • Unlike the other approach, we also proposed combining multiple sources of social media data to increase the number of datasets for better classification. • We evaluate the performance of the model built and compare it with other previous studies algorithm that give the best performance in predicting personality. • We show that our methods enable to produce better performance compare to the previous study in predicting personality traits.

Related works
Personality prediction using Facebook and Twitter dataset is not new. For example, research conducted by [18,37,38,40] used an open-source Facebook personality dataset called MyPersonality which consists of 250 users with their status data and traits, and maps to big five personality model. Prevalent feature extraction method called Linguistic Inquiry and Word Count (LIWC), which is a linguistic analytical tool that helps in analyzing quantitative texts and provides a calculation number of words that have the meaning of categories based on a psychological dictionary is used as the main feature extraction method. The use of these analytical tools is increasingly popular as can be seen from the use of these methods in line with research conducted in the last two years. In addition, Social Network Analysis (SNA) is a technique in analyzing social structures that arise from a combination of people in a specific population and the interactions that occur Fsoftin that population. These features are available in the MyPersonality dataset. However, there are differences for the research carried out [26], use of a dictionary called Structured Programming for Linguistic Cue Extraction (SPLICE) as a method for performing feature extraction. By using the dictionary features such as positive or negative evaluation of the speaker, a value for the complexity and readability of a text can be generated. After that, the collected features will be compared with the two approaches using machine learning algorithms such as Support Vector Machine (SVM), Linear Discriminant Analysis (LDA) and deep learning architecture such as Convolutional Neural Network (CNN). However, the resulting performance is still low in several personality models, namely in the range of 60%-75% accuracy score. This is caused by due to small number of dataset used in this study to capture much more contextual information in creating generalized model. Meanwhile, another study using Twitter dataset in Bahasa which were carried out by [3,19,31] used a different algorithm in building the personality prediction model. In feature extraction methods, the researcher was assessing the tendency user choice of words by using n-gram and LIWC. The implementation of close vocabulary method such as Term Frequency-Inverse Document Frequency (TF-IDF) is also applied in this study to show relationship between the main keywords discuss in their social media status data with their personality. This method used to filter out the least important words in the document and select the main topic in the sentences [9]. Another reliable open vocabulary feature extraction method called National Research Council (NRC) emotion lexicon database was also introduced in the previous study. This corpus created by National Research Council Canada with about 14,000 words in English along with the association of these word associations with eight common emotions, namely anger, fear, anticipation, trust, surprise, sadness, joy, and disgust and the sentiments of each of these words which can be positive or negative [15]. Moreover, to apply such open vocabulary feature extraction method, the dataset is translated into English first before the feature extraction method is carried out. The algorithm used in these experiments is the ensemble method approach, namely stacking, boosting, and bagging, resulting in increased accuracy from each previous experiment. From this experiment a high accuracy value of around 97.9% is generated using this Twitter dataset. However, the researcher state that there are biases in the result due to the extremely small size of the dataset after sampling. The approach used can result in removing the contextual meaning of the sentence contained in the social media data.
In terms of the latest technology, the use of deep learning has been widely applied to improve performance in predicting a person's personality. As in the experiment conducted by [10,17,23,41] using another dataset, namely personality Café, where there are differences of personality modeling, called Myers Briggs Type Indicator (MBTI) approach. The deep learning architectures such as Long Short-Term Memory (LSTM) and Transformer capable in producing a high performance of MBTI personality modeling with the maximum accuracy of 86.9%. Moreover, the use of pre-trained embedding has begun to be widely used in research to detect personality traits whereas in research [20,22] use pre-trained model such as BERT and RoBERTa as feature embedding to create the model for Big Five personality model. The result from this research shows 67% and 60% accuracy. The obstacle of the two studies lies in the small amount of data so that additional data is needed for further research development. On the other hand, the use of pre-trained models is used to solve other NLP problems such as text based emoticon classification [2] and toxic comment classification [30] using RoBERTa and XLNet shows improvement in accuracy when added with other NLP Features such as TF-IDF and sentiment analysis.
Compared to all these approaches, this research has concentrated to capture personality of a person from multiple social media data Facebook and Twitter through a combination of deep learning architectures with model averaging. Researcher also used NLP features as additional features to deep learning architecture, obtained from psycho-linguistic and basic linguistic features.

Methodology
By focusing on utilizing social media data Facebook and Twitter this research will be carried out in three stages, namely initiation, model development, and model evaluation. The details related to each stage can be seen in Fig. 1.
At the initiation stage, data collection was carried out to increase the amount of Twitter data that had been collected from previous studies [3,19,27]. Twitter data that has been collected manually will be annotated with the help of a psychological expert to define the personality of each Twitter user. On the other hand, the Facebook dataset will use an open-source dataset called MyPersonality.
Each dataset used in this study uses the Big Five personality traits modeling approach to classify a person's personality. Each of these personalities is related to several characteristics that a person will have. Table 1 describes some of the characteristics of each dimension.
A person can have two types of values in each dimension. If someone who has a high personality dimension will be represented by the number one, and other will be represented by the number zero in the dataset. This label will later become a variable predictor in the model to be built. Furthermore, all data that have been collected will be preprocessed separately due to differences in the language used in each dataset. The results of the preprocessed data will be carried out by the feature extraction and feature selection process before entering the model building stage. Each personality based on five personality traits will be made a model that aims to predict each personality.

Data
The first dataset called the MyPersonality dataset, which consists of 250 users with a total of 9917 statuses. This dataset is collected through a Facebook app in 2007, allowing users to participate in psychological research by filling in the personality questionnaire [7]. The second dataset is an expanded dataset from previous research done by [27], which is a manually collected twitter data in Bahasa Indonesia. In this extended version, the dataset is added with new manually collected data resulting in a total of 502  Quiet, emotionally controlled, comfortable, self-controlled users with 46,238 statuses. The addition of this twitter data was collected using the Twitter API. The same as previous research, the collected data will be annotated by psychology expert.
In addition, all datasets will be separated into three types, namely training set, test set, and validation set where the ratio of the distribution is 70% training set and 15% test and 15% validation. The data distribution for each dataset can be seen in Tables 2 and 3.

Preprocessing
All datasets will be preprocessed before feature extraction is carried out. The main purpose of preprocessing is to maximize the extracted features, hence more contextual features generated and normalized both datasets, since Twitter dataset written in Bahasa while Facebook dataset in English. The flow of the initial processes for both datasets is illustrated in Fig. 2.
In general, the two datasets are carried out the same preprocess. All data that has been collected will be removed from the use of URLs, symbols, and emoticons contained in social media status. Next, expanding a contraction in the sentence such as the use of I've become I have. After that, each sentence will be normalized by changing it to lowercase. Furthermore, any stopwords and clitics will be removed to prevent ambiguities. This list of words then will be processed using stemming function to normalize words by removing affixes to make sure that the resulting form is a known word in a dictionary. This preprocessing is carried out using the help of the NLTK library, which provides several linguistic functions to assist in cleansing social media status data such as tokenization, stemming, and stopwords dictionary. However, there will be an additional step during Twitter data preprocess, which is the translation process from Bahasa to English. In this  research this process is done by using Google translate API in translating Twitter status data.

Feature extraction
Researchers have recognized a novel combination of features which are profoundly compelling in aggression classification when applied in addition to the features obtained from the deep learning classifier at the classification layer. In this study, the researchers divide the feature extraction method into two types, pre-trained model features and statistical features. As mention before the use of pre-trained model include BERT, RoBERTa, and XLNet. These pre-trained models are different from language representation modeling in general, where this architecture is designed to do the initial modeling of two-way representations in the unlabeled text by combining the context of each token in sentences from left to right and from right to left on each layer [33]. For these predefined models to be able to extract the context from a sentence, several preparations must be made to meet the necessary requirements. The following Fig. 3 visualize the step of feature extraction process using pre-trained models.
First, an example of social media status will be added by using a special token at the beginning and end of the sentence, namely [CLS] (stands for classification) and [SEP] (stands for separation). The purpose of these tokens is to serve as an input representation for classification tasks and to separate a pair of input texts respectively. Next, each word in a sentence will be tokenized which later become a sequence of words token. The tokenization is done using a method called WordPiece tokenization. This is a data-driven tokenization method that aims to achieve a balance between vocabulary size and out-of-vocab words. Each word that has been tokenized will be mapped with a WordPiece vocabulary. Each of pre-trained model has their own corpus dimension size for example BERT consist of 30,522 words, RoBERTa 50,257 words and XLNet 320,000 words. Each of these words represent in a 768 fixed dimension in a vector representation. In the example Fig. 3 the example social media status will be converted into a 12 tokens representation with each token consist of 768 lengths which is called token embedding. Before continuing to model building process this embedding will be added with another embedding layers called segment embedding and positional embedding to provide more contextual meaning to the model. The segment embeddings layer only has two vector representations. The first vector (index Fig. 3 Pre-trained models feature extraction 0) is assigned to all tokens included in the first input while the last vector (index 1) is assigned to all tokens included in the second input. If the input consists of only one input sentence, then the segment embedded will only be the corresponding vector with index zero from the segment embeddings table. On the other hand, positional embedding layer is designed as a lookup table of sizes (n, 768) where n represent the number of length sentences. The first row is a vector representation of any word in the first position, the second row is a vector representation of each word in the second position and so on. The combination of those three embeddings called input embedding, act as a solution to overcome the limitations of architectures deep learning other such as the RNN which cannot capture sequence information, the combination of the three embeddings makes pre-trained model adaptable to NLP problems [12]. Table 4 describe list of pre-trained model along with maximum sequence length used to build the model, and references used to obtain them.
In terms of statistical features, this research uses different approaches compare to the previous research [37] which use TF-IDF as term weighting factor, instead TF-IGM is introduced in research. TF-IGM combine a new statistical model to precisely measure the weight of each class in text. The weight states the importance or word contribution to the class of documents. Furthermore, this method is able to separate label classes in a textual data, especially for data that has more than one label. Hence, this method is very suitable for use in personality prediction which allows a person to have more than one personality. The TF-IGM value can be calculated by looking for the TF value and the IGM value. TF represents the weight of a word, where how many words appear in a document. Meanwhile, IGM is useful for measuring the strength of a word in distinguishing between one class and another. The calculation of TF and IGM values can be described in the following formula: represent an adjustable coefficient, which use to keep the relative balance between the global and local factors in the weight of a term. Moreover, TF-IGM value will range from zero to 1. Each word in a document will be counted with a TF-IGM value then the words will be sorted according to the largest value. Words that have great value will be used as a feature for making classification models because it can be assumed that these words contain the important meaning of a document with a specific class label. Lastly, the use of semantic analysis and NRC emotion lexicon as correlating features in predicting characteristics of a person as were also used in this study. Both methods use the open vocabulary approach, which require a predefined corpus in finding the contextual feature from a text data. The Table 5 describe list of features, and references used to obtain them and the number of features.

Model prediction
Deep learning method has become more and more popular in recent periods, where several related studies use neural network architectures such as CNN and LSTM in making the best model for personality prediction systems. However, in this study, the multi model deep learning architecture was introduced by combining the statistical based text feature and a predefined model feature to improve the performance in predicting a personality of a person. In this research five classifiers will be made for this personality prediction system where each classifier represents the personality from the big five personality traits model. Figure 4 represent the model architecture that was build.
Each of input embedding extracted from the pre-trained model will be feed into a selfattention mechanism. Self-attention allows the models to associate each word in the input, to other words. It is also possible that the model learns that words structured in  this pattern are typically a question so respond appropriately. To achieve self-attention, it feed the input into three distinct fully connected layers to create the query, key, and value vectors. The queries (Q) vector and keys (K) vector undergo a dot product matrix multiplication to produce a score matrix. The score matrix determines how much focus should a word be put on other words. So, each word will have a score that corresponds to other words in the time-step. The higher the score the more focus. This is how the queries are mapped to the keys. Next, the scores get scaled down by dividing with the square root of dimension query and keys (d_x). This process is to prevent exploding effect on the value hence, allowing for more stable gradient. After that, the softmax function is used to scaled down score to get attention weights and give an output in form of probability between zero and one. This function receives input of output matrix from scaled function (x_i) and sum of data inside the matrix (x_j). Applying this function makes the higher score get heighten and lower score depressed, therefore allowing the model to be more confident about which word to attend to. The softmax function and scaled down function denoted by the following formula.
Finally, the attention weights will be multiplied with the value vector to get output vector. A higher softmax score will make the model learns that the higher value of the words means more important. Lower scores will eliminate irrelevant words. This final value will be concatenate to the original positional input embedding which is called a residual connection. The output of the residual connection will be combined with statistical NLP features with total of 623 features and inserted to a feed forward neural network. Inside each neural network will consist of three connected layers with alternating Rectified Linear Unit (ReLU) activation function and batch normalization in between them.

Fig. 4 Proposed Model Architecture
Moreover, a dropout function also applied in order to reduce overfitting and generalization error. Dropout deactivates the neurons randomly at each training step instead of training the data on the original network. In the next iteration of the training step, the hidden neurons which are deactivated by dropout changes because of its probabilistic behavior. Lastly, the output from the feedforward will be included in the averaging model function.
According to previous deep learning literature [21,28], the unweighted averaging might be a reasonable ensemble for similar base learners of comparable performance. In this research the model averaging (unweighted) can be calculated by combining the softmax probabilities from three different classifications model. The mean class the probability is calculated as follow: where K is the number of classes, and y is the predicted label for a sentence. For loss function, a cross entropy loss denoted by the following formula was used.
where y is the actual label value and p is the predicted personality of from a sentence. To maximize the performance of the model built, the parameter tuning process will be carried out. Grid search method will be used to perform repeated searches in finding optimal parameters that will produce the maximum level of predictive performance. Some of the parameters to be modified are the batch size, epoch, and learning rate.

Evaluation metric
The results of the model that have been created will be evaluated using several metric measurement approaches as follows: a. F1 Measure Measurement metric of a model which combines the average values of precision and recall producing score by considering a classification error. These measurement metrics are best used when false negative and false positive values are important. In the prediction of personality, the false positive and false negative values are considered to reduce predictive errors because if the predictions are wrong, maybe someone can be placed incompatible with their personality.  Useful as a measure of the performance of a model, however this measurement focuses on the total data that is precisely predicted, namely true positive and true negative. This measurement is good for class distribution on balanced data. Based on previous research, many use this measurement as evaluation metric. Therefore, to compare the results of research with previous studies, this metric is used.

Experiment
To compare the methodology that has been proposed previously, a comparison will be made by comparing to the other architecture with several feature extraction methods that are able to produce the best personality predictions in previous studies using the same big five personality model. Therefore, this research will be divided into several experimental scenarios where each scenario will use different algorithms along with feature extraction and feature selection method applied on two different datasets which are Facebook, and Twitter from to determine the performance of the proposed system. Table 4 shows the breakdown combination of each scenario (Table 6). From the design scenario proposed, there are three types of model will be compare. First by using only pre-trained model (BERT, RoBERTa, XLNet). Second, with an addition of NLP statistical features which consist of NRC Lexicon Database, TF-IGM, and Sentiment Analysi. Lastly, the proposed architecture with model averaging from the three classifiers and addition of NLP statistical features. Each deep learning architectures will be tuned with different batch size and learning size on both social media datasets Facebook and Twitter.

Result and discussions
All predictive models that have been created will be evaluated using the accuracy and f1 measure metric approach. The results of the model evaluation can be seen in the table below. Table 7 is the result of the evaluation using the Facebook dataset, which shows that the highest accuracy produced from each trait dominated by the proposed model which used model averaging method and NLP statistical features. The first and second highest accuracy is produced in the Openness personality model with 86.17% accuracy and 0.912 f1 measure score, Neuroticism trait with an accuracy of 78.21% and f1 measure score 0.709. However, in terms of Agreeableness trait the highest accuracy and f1 measure score of other personality models is created by using the XLNet with addition of NLP features, which is found in Agreeableness personalities with accuracy values of 72.33% and 0.701. However, the resulting value differs slightly for about 0.74% and 0.011 from the proposed model in terms of both metrics. By looking at all the mean accuracy of the experiments on each algorithm, it was found that the proposed model architecture (11) Accuracy = True Positive+True Negative True Positive + False Positive + True Negative + False Negative . has the highest average accuracy and f1 score, which is 77.34% and 0.749 for personality prediction system using Facebook dataset. Furthermore, Table 8 defines the results of the evaluation using the Twitter dataset, which was collected manually. Similar to the previous results, the use of the proposed model architecture produces the best accuracy along with f1 measure score and dominates the highest performance from the five personality models. It shows the accuracy of 88.49% in the conscientiousness personality became the model with the highest accuracy, followed by extraversion personality, which was 81.17% and neuroticism was 75.08%. Differs from the previous results, the use of BERT with NLP features surpasses the results of the proposed model for the Agreeableness personality models with total accuracy of 72.33%. On the other hand, although the highest accuracy generated in conscientiousness trait is high, the f1 measure score gives a low result with value for only 0.652, this value is caused by the model tends to predict the low dominant trait rather  than the high dominant trait therefore causing affect in precision and recall value. However, the proposed model architecture still give the highest results in an average accuracy and f1 score across other algorithms with a value of 77.34% and 0.760. Next, Table 9 shows the results of the combination of parameters from the proposed deep learning model architecture which shows the best accuracy and f1 measure in performance. The model parameter tuning was carried out using tenfold cross validation. In determining the batch size and learning rate, the validation data mentioned in the previous section are used. From the result it is shown that for the model used to predict Openness, Extraversion, and Agreeableness required the batch size which is 16 while the other remaining traits required 32 batch size to get the optimum result. As for learning rate, all traits except Neuroticism use 3.00E−05 for the optimum performance. While the rest used 1.00E−5 for the optimum performance.
Finally, Tables 10 and 11 represent the comparative experimental results for the proposed method in this paper with respect to the state-of-the-art. To compare the overall system performance, some research uses the average accuracy and the other average f1-measure. The top 4 models given in Tables 10 and 11 are the best performing models for Facebook dataset and Twitter dataset with Big Five personality model as the label  respectively. By analyzing these values, it can be concluded that the proposed deep learning architecture is able to provide the best model performance for in terms of the overall average performance of accuracy and f1-measure among all of the approaches. Moreover, the results also state that all the classifier with NLP features performs better compare to the individual pre-trained model features. This means providing NLP features will increase the model performance in predicting personality traits.

Conclusion
This research shows the comparison of different feature extraction method along with different algorithm approach in building personality prediction system for multiple social media data sources. Through this experiment, the proposed deep learning architecture approach with BERT, RoBERTa, XLNet as pre-trained language model, NLP  statistical features and model averaging outperform on most personality model builds by producing the highest accuracy of 86.17% and f1 measure score 0.912 on Facebook dataset and 88.49% accuracy and 0.882 f1 measure score on the Twitter dataset. Moreover, an addition of NLP statistical features such as TF-IGM, sentiment analysis, and NRC lexicon database contributed significantly to the personality prediction system on both datasets, since it can increase the model performance compare to only using pre-trained model as extraction features. Future development of this experiment may utilize the use of larger training and testing dataset. Furthermore, another comparison approaches such as implementing another pre-trained model such as ALBERT which is A Lite BERT for Self-supervised Learning of Language Representation, DistilBERT, and BigBird may also be a possible candidate to increase accuracy in the personality prediction system.