The algorithm we propose improves upon previous work from [17] and consists of four phases (see Fig. 4): data collection, text and user processing, tweet classification with machine learning and opinion modeling. While the first two phases are of standard practice in the literature, tweet classification by means of machine learning only recently took place [17, 18], given the impossibility to classify by hand millions or even billions of datapoints. Opinion modeling, the core of our election prediction model, is an attempt to instantly capture people’s opinion through time by means of a social network. To improve upon [17], we consider the cumulative opinion of people and define three prediction models based on different assumptions on the loyalty classes of users to candidates, homophily measures and re-weighting scenarios of the raw data. Below we explain each phase, highlighting the steps that make our full-fledged AI predictor a good candidate substitute for the traditional pollster methods.
Data collection
By means of the Twitter public APIs, we continuously collect tweets until the election day (from March 1, 2019 until October 27, 2019), filtered according to the following queries (corresponding to the candidates’ name and handlers of the 2019 Argentina primary election): Alberto AND Fernández, alferdez, CFK, CFKArgentina, Kirchner, mauriciomacri, Macri, Pichetto, MiguelPichetto, Lavagna. Only tweets in Spanish were selected. Figure 5a shows the daily volume of tweets collected (brown line) while Fig. 5b shows the daily number of users (green line). In blue we report the daily number of tweets/users which are classified, i.e. they posted at least one classified tweet. Users are classified with machine learning (explained below) as supporters of Macri (Fig. 5d, red line) if the majority of their daily tweets are classified in favor of Macri (Fig. 5c, red line) or as supporters of Fernández in the other way around (blue line in Fig. 5d and c). Hereafter we use FF to indicates the Fernández-Fernández formula and with MP we refers to the Macri-Pichetto formula (the outgoing president/vice-president candidate).
The activity of tweets/users shows a peak on August 11, 2019, i.e. the day of the primary election. In the period from March to October, we collected a daily average of 282,811 tweets posted by a daily average of 84,062 unique users. We daily classified 75% of these tweets and \(\sim\) 76% of the users (see Additional file 1: Table S1 and Table S2). In total, by the end of October we collected around 110 million tweets broadcasted by 6.3 million users. This large amount of tweets collected has no precedent and is relevant in the light of considering that Argentina is one of the most tweeting per capita countries in the world.
User and text processing
Below we explain the tasks that need to be applied to the raw data before any analysis is performed.
Bot detection
The identification of software that automatically injects information in the Twitter’ system is of fundamental importance to discern between “fake” (bots) and “genuine” users, the latter representing the real voters.
According to [18] a good strategy is to extract the name of the Twitter client used to post each tweet from their source field and kept only tweets posted from an official Twitter client. Third-party clients represents a variety of applications, form applications mainly used by professionals for automating some tasks (e.g. dlvrit.com) to manually programmed bots. This simple method allows to identify tweets that have not been automated from those automated by bots, and scales very easily to large datasets contrary to more sophisticated methods.
Figure 6a and b show the daily number of tweets posted by bots and the daily volume of bots, respectively. Figure 6c and d show the daily volume of classified tweets/bots. The daily average of bots between March and October 2019 is 732 with an overall daily activity (in average) of 2243 tweets. The daily classified tweets are 1617 while the daily classified bots are 560 bots. As for “genuine” users, a bot is classified if it share at least 1 classified tweet. In the entire dataset we found around 20,000 bots which posted 538,350 tweets. Let us notice that even though we classified the bots, they are not used for the final prediction since they do not corresponds to real voters.
Text standardization
Stop word removal and word tokenization are of common practice in Data mining and Natural language processing (NLP) techniques [39, 40]. For example, we keep the URLs as tokens since they usually point to resources determining the opinion of the tweet, through replacing all URLs by the token “URL”.
Tweet classification
To build the training set we analyze the hashtags in Twitter. Users continuously labels their tweet with hashtags, which are acronyms able to directly transmit the user feeling/opinion toward a topic. We manually labeled the top hashtags used in the dataset in 2019 August (see Additional file 1: Table S3, also we report hashtags after August in Additional file 1: Table S4). They are classified either as pro Macri, pro Fernández or pro Third party candidate, depending on who they support (with Third party we refer to the supporters of Lavagna, Espert and other secondary candidates). We also consider negative hashtags for one candidate as pro for the other candidate, when this is obvious. For instance, #Macriescaos as pro-Fernández.
Hashtag co-occurrence network
In order to check the quality of the classification of the classified hashtags we build the hashtag co-occurrence network H(V, E) and statistically validate its edges [17, 41]. In the co-occurrence network the set of vertices \(v \in V\) represents hashtags, and an edge \(e_{ij}\) is drawn between \(v_i\) and \(v_j\) if they appear together in a tweet. We test the statistical significance of each edge \(e_{ij}\) by computing the probability \(p_{ij}\) (p-value of the null hypothesis) to observe the corresponding number of co-occurrences by chance only knowing the number of occurrences \(c_i\) and \(c_j\) of the vertices \(v_{i}\) and \(v_{j}\), and the total number of tweets N. Fig. 7 shows the validated network. We only keep those edges with a p-value \(p<10^{-7}\). The blue community contains the hashtags in favor of Fernández, the red community those in favor of Macri and the green one (a very small group) are those in favor of the Third candidate. A look at the typologies of hashtags reveals the first differences in the supporters. Those in favor of Cristina Kirchner are much more passionate than the follower of Macri. For example, Kirchner’s type of hashtags are #FuerzaCristina, #Nestorvuelve, #Nestorpudo or they are very negative to Macri as #NuncamasMacri. On the other hand, Macri’s group is smaller and less passionate with hashtags like #Cambiemos or #MM2019 (see Fig. 8), while support for the third candidate has not taken traction and its electoral base on Twitter is very small.
In principle, counting the users and tweets according to the hashtags they use would predict the victory of Fernández over Macri. However this conclusion would be based only on \(\sim\) 10,000 users (those expressing their opinion through hashtags). In order to get the opinion of all the users we train a machine learning model that classifies each tweet as FF, MM or Third party (In what follows we also refer to the formulas FF for Fernández-Fernández and MP for Macri-Pichetto, the final formulas in the presidential contest). We use the previous set of hashtags expressing opinion to build a training set of labeled tweets, which are used in turn to train a machine learning classifier. We use all the tweets (before August) which contain at least one of the classified hashtags to train the model. In the case of more than one hashtag for a tweet, we consider it only if all the hashtags are in favor of the same candidate. The use of hashtags that explicitly express an opinion in a tweet represents a “cost” in terms of self-exposition by Twitter users [42] and therefore allows one to select tweets that clearly state support or opposition to the candidates. The training set consists of 228,133 tweets, i.e. the 0.33% of the total amount of collected tweets and the \(\sim\)90% of the hand-classified tweets (253,482 tweets). In order to find the best classifier we used five different classification models, the Logistic Regression (LR) with \(L_2\) regularization, the Support Vector Machine (SVM), the Naive Bayes (NB), the Random Forest (RF) and the Decision Tree (DT). All these models are validated on the remaining 10% of the classified tweets (25,349). Table 4 shows the results for the models. The Logistic Regression performs better than the other models with an average group accuracy equals to 83%. Also recall and F1-score are equal to 83%. Support Vector Machine is the second best classifier, with an average accuracy of 81%, followed by the Naive Bayes with and average accuracy of 79.5%, the Random Forest and the Decision Tree.
We recall that the logistic regression assigns to each tweet a probability p of belonging to a class. In our case such probability goes to one if the tweet supports Macri while it goes zero if it supports Fernández. As it is shown in Fig. 9 the distribution of p contains two peaks, one on the left and one on the right, divided by a plateau. This is an encouraging result, since it proves the efficacy of the model to discern between the two classes. We classify a tweet in favor of Macri if \(p \ge 0.66\), in favor of Fernández if \(p \le 0.33\), otherwise it is unclassified. Tweets with a value of p in the plateau are instead unclassified, meaning that the tweet does not contain sufficient information to be classified in either camp. According to this rule, in average we classify 211,229 genuine tweets and 1617 tweets from bots per day (see Additional file 1: Tables S1, S2).
Opinion modeling
We can infer users’ opinion from the majority of the tweets they post. Let \(n_{t,F}\) be the number of tweets posted by a given user at time t in favor of Fernández and let \(n_{t,M}\) be those supporting Macri. We define an instantaneous opinion over a window of length w and a cumulative average opinion as follow. In the first case, a user is classified as a supporter of Fernández (at a given day \(t=d\)) when
$$\begin{aligned} \sum _{t=d-w+1}^{d}n_{F,t} > \sum _{t=d-w+1}^{d}n_{M,t} \end{aligned}$$
(1)
i.e if the majority of the tweets posted in the last w days were in favor of Fernández. The user is classified as a supporter of Macri if
$$\begin{aligned} \sum _{t=d-w+1}^{d}n_{F,t} < \sum _{t=d-w+1}^{d}n_{M,t} \end{aligned}$$
(2)
If none of the previous conditions is met, i.e. if
$$\begin{aligned} \sum _{t=d-w+1}^{d}n_{F,t} = \sum _{t=d-w+1}^{d}n_{M,t} \end{aligned}$$
(3)
then the user is classified as undecided. Let us notice that when w goes to one we have the ‘most’ instantaneous prediction, that is the prediction based on what people think in the last day. This instantaneous prediction model was used in the previous work [17] to match the results of the AI model to the aggregate of polls from the New York Times in the 2016 US election with excellent results. However, this predictor did not match the results of the electoral college, which required stratification by states. Thus, we further develop the AI model of [17] to add other predictors beyond the instantaneous measures.
Traditional poll data collection is an instantaneous prediction with a value of w that can go from few days up to few weeks, which is the time of collection of the poll data and this corresponds roughly to our instantaneous measurement above. However, the fact that we are able to track the same user over long period of time in Twitter allows us to extend the window of observation as far as we want to then define a new measure that we call the 'cumulative opinion'. The cumulative opinion in our model is defined by extending w to the initial date of collection for every time d of observation, i.e., \(w=d\). Thus the cumulative opinion considers the opinion of a user based on all the tweets he/she posted from time \(t=0\) up to the observation time d. That is, our prediction is longitudinal as we are able to follow the opinion of the same user over the entire period of observation of several months. In terms of traditional poll methods, a cumulative opinion would be obtained in a group panel collecting for each respondent in the sample and for each day starting from \(t=0\) her/his preference towards a candidate. This possibility, which would require an unimaginable amount of effort and time for traditional poll methods, it is quite straightforward when it comes to social networks and big-data analyses.
We start by investigating the instantaneous response of the users in a fixed window of time. Figure 10a shows the Twitter supporters dynamics over time obtained with a window average, \(w=14\) days. Users are classified as MP (Macri-Pichetto, in red), FF (Fernández-Fernández, in blue) or Others (in green). Figure 10b shows the supporters dynamic (thick lines) compared with Elypsis prediction from their pools (thin dashed lines) without considering the undecided users in the normalization. In the same plots we also report the official results for both primaries and general elections. The comparison between the two figures stands out as an approximate correlation between the Elypsis and our AI results for each candidate valid for these instantaneous measures. However, in the comparison among candidates, predictions may sometimes differs, as for example, right before the beginning of August, Elypsis gave as favorite MP while the AI instantaneous prediction was in favor of FF. Overall, as for the pollsters results, window average analyses are representative of the instantaneous sentiment of the people. As we see from the figures, instantaneous opinions are affected by considerable fluctuations [17] which make the prediction not reliable. In Additional file 1: Fig. S1, we compare the average window opinion with other pollsters (Real Time Data, Management & Fit, Opinaia, Giacobbe and Elypsis). An interpolation (thin lines) shows similar trends as the AI-model window average, stressing that the conclusions made so far are more general than the single comparison with Elypsis. In fact in [17] we have shown that the instantaneous predictions of the AI model follows quite closely the aggregation of polls obtained from the New York Times, ‘The Upshot’, yet, it does not reproduce the results of the general election in the electoral college which further requires a segmentation by states where proper prediction of rural and non-rural areas becomes the key. Considering the cumulative opinion, not the instantaneous one, of each user is crucial to correctly predict the elections.
Thus, we next study the opinion of each user by considering the cumulative number of tweets over the entire period of observation to classify the voter’s intention (Model 0). This cumulative approach takes into consideration all the tweets together for each user since the first time they enter in the dataset and bases the voter intention on all of them. This cumulative approach can only be done with Twitter and not with traditional polls, except for short times and particular cases as done by Elypsis before and after PASO.
Figure 11 shows the prediction of the model using the cumulative opinion of the users from March 1 until a few days before the general elections. We can see that this approach captures the election results well, and, in particular, the huge gap between the candidates, both for the primary election and the general election (vertical lines from the left to the right). While a low precision is of secondary importance when the difference between the opponents is high, it plays a central role when they have a close share of supporters. As an extreme example, in an almost perfect balanced situation, the change of mind of just few people may flip the final outcome. If on the one hand a cumulative approach do reduce the fluctuations in the signal, it is also less sensitive to sudden change of opinion. A person can support a candidate until few days before the elections, and then change her/his mind because of some particular facts. This and other possibilities can be taken into account only by a model based on cumulative analyses, able to capture the degree of loyalty of people towards the candidates over time. Differently from the traditional surveys, the real-time data processing that underlies our AI algorithm gives the possibility to take into consideration this scenario. To understand how different re-weighting scenarios affect the results, below we introduce different loyalty classes of users towards the candidates and then we define several models matching the criteria previously discussed. These loyalty classes can be defined when we consider the cumulative opinion in a longitudinal study and cannot be investigated by traditional polls.
Loyalty classes
We define five classes of loyalty for users who support candidate c. Here we consider the FF supporters, but the definitions below similarly applied to the other candidates.
-
Ultra Loyal (UL): users who always tweet only for the same candidate, for example
$$\begin{aligned} \sum _{t=T_0}^T (\frac{n_{F,t}}{n_{F,t}+n_{M,t}+n_{T,t}}) = 1\mathrm{,} \end{aligned}$$
(4)
where with \(n_{c,t}\) we indicate the number of tweets that a given user post in favor of c, with \(c\in\) \(\lbrace\)M (Macri), F (Fernández), U (Unclassified)\(\rbrace\).
Differently from the ultra loyal, which continuously post in favor of a candidate, the other classes take into consideration a possible change of opinion of a user. In order to detect sudden twist of opinions we focus on the classifications of the last k tweets posted by the users. We define:
-
Loyal FF \(\rightarrow\) FF: a user which is FF since the majority of tweet are for FF, but she/he also supported FF in the last k tweets, mathematically speaking
$$\begin{aligned} \sum _{t=T_0}^{d}n_{F,t}> \sum _{t=T_0}^{d}n_{M,t} \ AND \sum _{i=N-k+1}^k n_{F,i} > \sum _{i=N-k+1}^k n_{M,i}+n_{U,i} \end{aligned}$$
(5)
N is the total number of tweets posted by the user and \(n_{c,i}\) is 1 if the i-th tweet of user is classified as supporting c, otherwise 0.
-
Loyal FF \(\rightarrow\) MP: users that are FF by the total cumulative count but they have tweeted for MP in the recent k tweets. namely in formula:
$$\begin{aligned} \sum _{t=T_0}^{d}n_{F,t}> \sum _{t=T_0}^{d}n_{M,t} \ AND \sum _{i=N-k+1}^k n_{M,i} > \sum _{i=N-k+1}^k n_{F,i}+n_{U,i} \end{aligned}$$
(6)
-
Loyal FF \(\rightarrow\) TP: users supporting the third party in the last k tweets, i.e.
$$\begin{aligned} \sum _{t=T_0}^{d}n_{F,t}> \sum _{t=T_0}^{d}n_{M,t} \ AND \sum _{i=N-k+1}^k n_{U,i} > \sum _{i=N-k+1}^k n_{F,i}+n_{M,i} \end{aligned}$$
(7)
-
Loyal FF \(\rightarrow\) Undecided: all other individuals classified as FF but not included above.
Let us remind that unclassified refers to all those users who do not have any classified tweet. Fig. 12 shows the cumulative prediction for each class, with \(T_0\)= March 1, 2019 and \(k=10\). The Ultra Loyal class for Fernández (FF) represents \(\sim\) 33% of the populations while only \(\sim\)20% of the populations is Ultra Loyal towards Macri (MP). Loyal MP\(\rightarrow\)MP and loyal FF\(\rightarrow\)FF represents between the 8% and the 13% of the studied Twitter population. The percentage of the undecided is around 8% and the third party percentage. The other classes are close to 1 or 2%. In the next section we use these classes in order to define a better election predictor than Model 0.
AI models based on loyalty classes
The loyalty classes introduced so far are one of the main differences with the Twitter based studies from [17]. Here, we use the machine learning classifier (logistic regression here) to define the loyalty of a user and not to make predictions. We then make the predictions by grouping supporters as follows:
-
Fernández supporters: all those users who are Ultra loyal FF, Loyal FF\(\rightarrow\)FF, Loyal FF\(\rightarrow\)MP, Loyal FF \(\rightarrow\) Undecided.
-
Macri supporters: all those users who are ultra loyal MP, Loyal MP\(\rightarrow\)MP, Loyal MP\(\rightarrow\)FF, Loyal MP\(\rightarrow\)Undecided.
In each group we put those users we are almost sure who they support because of their activity over time. However, as we saw in the previous section, undecided may play a central role in a scenario where few percentage points can flip the final result. Furthermore, understanding unclassified users (i.e. those users which do no not have any classified tweet) will also improve the final statistic. In order to take into account all the reasonable scenarios of results, we define three different models (starting from the classification in Fernández and Macri above) and validate them against the final results of the election. Table 5 resumes the details of each model.
Model 1: All the users belonging to one of the following classes are grouped in the Third Party: Undecided\(\rightarrow\)MP, Undecided\(\rightarrow\)FF, Undecided\(\rightarrow\)Undecided and Unclassified.
Model 2: Instead of simply grouping the undecided in a third party, we use network homophily to infer their political orientation. A user is classified as MP(Undecided) if the majority of her/his neighbors in the undirected retweet network supports Macri. The same definitions applied for the other cases. In this model, FF(Undecided) are considered supporters of Fernández and MP(Undecided) supporters of Macri. Undecided(Undecided). In this model the Unclassified belong to the Third Party (or Others). We remind the reader that the Unclassified users are those users who always tweeted unclassified tweets (those with 0.33 < p < 0.66). Undecided users are those who satisfies Eq. (3).
Model 3: We are also analyzing the users’ profiles while collecting tweets, including the head portrait and the location information. Firstly, users outsides Argentina are removed from the dataset. Face analyzing tools developed by a state-of-the-art facial recognition algorithms have been validated and applied in the context of profile pictures [43, 44]. We obtain more than 400 thousand users’ age and gender information, see the population distribution of Twitter users from Fig 13. In this model, we adjust the results of Model 2 for each candidate by re-weighting the Twitter population to the Census data by
$$\begin{aligned} \sum _{gender,age} \frac{N_{u_{c}}(gender, age)}{N_{u}(gender, age)} \frac{N_{\mathrm{Census}}(gender,age)}{N_{\mathrm{Census}}} \end{aligned}$$
(8)
\(N_{u_c}(gender, age)\) user, whose gender equals gender and age equals age, support candidate c. From the Census data, \(N_{Census}(gender,age)\) people belong to the category of gender and age. (\(c=\{\mathrm{FF}, \mathrm{MP}, \mathrm{Third\ party}\}\) \(gender \in \{\mathrm{female}, \mathrm{male}\}, age \in \{[16, 30], [31, 50], [51, 65], [65, \infty ]\}\)).
In the next section we employ these three models and compare the performances of models on the 2019 Argentina election.