Why polls fail to predict elections

In the past decade we have witnessed the failure of traditional polls in predicting presidential election outcomes across the world. To understand the reasons behind these failures we analyze the raw data of a trusted pollster which failed to predict, along with the rest of the pollsters, the surprising 2019 presidential election in Argentina which has led to a major market collapse in that country. Analysis of the raw and re-weighted data from longitudinal surveys performed before and after the elections reveals clear biases (beyond well-known low-response rates) related to mis-representation of the population and, most importantly, to social-desirability biases, i.e., the tendency of respondents to hide their intention to vote for controversial candidates. We then propose a longitudinal opinion tracking method based on big-data analytics from social media, machine learning, and network theory that overcomes the limits of traditional polls. The model achieves accurate results in the 2019 Argentina elections predicting the overwhelming victory of the candidate Alberto Fern\'andez over the president Mauricio Macri; a result that none of the traditional pollsters in the country was able to predict. Beyond predicting political elections, the framework we propose is more general and can be used to discover trends in society; for instance, what people think about economics, education or climate change.


Abstract
In the past decade we have witnessed the failure of traditional polls in predicting presidential election outcomes across the world. To understand the reasons behind these failures we analyze the raw data of a trusted pollster which failed to predict, along with the rest of the pollsters, the surprising 2019 presidential election in Argentina which has led to a major market collapse in that country. Analysis of the raw and re-weighted data from longitudinal surveys performed before and after the elections reveals clear biases (beyond well-known low-response rates) related to mis-representation of the population and, most importantly, to social-desirability biases, i.e., the tendency of respondents to hide their intention to vote for controversial candidates. We then propose a longitudinal opinion tracking method based on big-data analytics from social media, machine learning, and network theory that overcomes the limits of traditional polls. The model achieves accurate results in the 2019 Argentina elections predicting the overwhelming victory of the candidate Alberto Fernández over the president Mauricio Macri; a result that none of the traditional pollsters in the country was able to predict. Beyond predicting political elections, the framework we propose is more general and can be used to discover trends in society; for instance, what people think about economics, education or climate change. * hmakse@ccny.cuny.edu Traditional polling methods [1] using random digit dial phone interviews, opt-in samples of online surveys, and interactive voice response are failing to predict election outcomes across the world [2][3][4]. The failure of traditional surveys has also been widely discussed in the press [5] and on the specialized literature [4]. For instance, the victory of Donald Trump in the US 2016 presidential election came as a shock to many, as none of the pollsters and political journalists and pundits, including those in Trump's campaign, could predict this victory [4,6].
The reasons for the failure of pollsters to predict elections are believed to be many [4,6].
First reason is that the percentage of response to traditionally conducted surveys has decreased and it is becoming increasingly difficult to get people's opinion [4,7]. Response rates in telephone polls with live interviewers continue to decline, as it has reached 6% lower limit recently [7]. Response rates could be even lower for other methodologies, like internet polling or interactive voice response. Compounded with declining response rates is the concomitant problem of mis-representation of the survey samples. That is, the sample surveyed by pollsters does not represent the demographic distributions of the general population. This problem is ameliorated by reweighting the surveys sample according to the general demographics of the population in a process called sample-balancing or raking [8,9].
However, in countries where the vote is not obligatory, re-weighting a sample to the general population demographics (obtained from Census Bureau [1,4]) fails since the general population demographics does not match necessarily the demographics of the voter turnout: it is impossible to predict which demographic groups will turn out at the voting station. Thus, if an underrepresented group in the polls, 'the hidden vote', decides to vote on election date, the re-weighting fails leading to highly inaccurate results. The issue is believed to be one of the major reasons for the generalized failure of pollsters to predict the triumph of Trump in the 2016 US presidential election, where groups generally defined as 'white voters without college degree' mostly voted for Trump but were undersampled by all pollsters. Even with this historical information at hand, which supposedly allowed pollsters to resample their surveys more carefully, pollsters again under-predicted the support for Trump in the subsequent 2020 presidential election in some states or under-predicted the voter turnout supporting Biden in newly created battleground states like Georgia, USA [10]. The inability to accurately predict the voter turnout to deal with sampling mis-representation might render the pollsters obsolete.
While there is increasing evidence [4,7] that the nonresponse and mis-representation bias might be the reason that polls are not producing accurately matched election results, these may not be the only problem of traditional methods of polling. Traditional surveys in heavily polarized campaigns are affected by social-desirability biases (also called Bradley effect) [11,12], i.e. the tendency of subjects to give socially desirable responses instead of choosing responses that are reflective of their true feelings. For instance, the tendency of survey respondents not to tell the truth of intention of support for controversial candidates which could open themself to social ostracism. Respondents may feel under pressure to provide 'politically correct' answers producing highly-biased results towards the publically accepted candidate by the media, in detriment of the controversial one. This mechanism is also believed to have been at place in the massive failure of pollsters to predict 2016 US election as Trump voters generally can go undetected or even lie to traditional pollsters. We will show below, that it was also the major reason for the pollsters failure to predict the 2019 Argentina election, which also involved a controversial candidate (Cristina Fernández) who was heavily under-predicted in traditional polls. Furthermore, polls are not able to detect sudden change of opinion due to some particular events or circumstances, since the process of opinions collection is time consuming. All these peculiarities together makes impossible for traditional polls to correctly predict the results of elections.
Monitoring social networks [13,14] represents an alternative for capturing people's opinions since it overcomes the low-response rate problem and it is less susceptible to socialdesirability biases [11,12]. Indeed, social media users continuously express their political preferences in online discussions without being exposed to direct questions. One of the most studied social networks is the microblogging platform Twitter [15][16][17][18]. Twitter's based work generally consist of three main steps: data collection, data processing and data analysis.
The collection of the tweets is often based on the public API of Twitter. It is a common practice to collect tweets by filtering according to specific queries, as for example the name of the candidates in the case of elections [17]. Data processing includes all those techniques which aim to guarantee the credibility of the Twitter dataset. This is, for example, bots detection and spam removal [18]. Data analysis, the core of all these studies, can be simplified in four main approaches: volume analysis, sentiment analysis, network analysis and artificial intelligence (AI) [13].
Scholars used the number of mentions for a party of a candidate in order to forecast the result of the 2009 German parliament election [19]. While their technique has attracted many criticisms [20], their work was of inspiration to many other researchers. Gaurav et al. [21] proposed a model, based on the number of times the name of a candidate is mentioned in tweets prior to elections, to predict the winner of three presidential elections held in Latin America (Venezuela, Paraguay, Ecuador) from February to April, 2013. Based on volumetric analyses are also the works of Lui et al. [22] and Bermingham [23]. Ceron et al. [24] performed a sentiment analysis study on the tweets to check the popularity of political candidates in the Italian parliamentary election of 2011 and in the French presidential election of 2012. Caldarelli et al. [25] used the derivative of the volume to forecast the results of Italian elections. Singh et al. [26] employed sentiment analysis to predict victory of Trump in the election of 2016. The same author proposed a method [27] based on sentiment analyses and machine learning on historical data to predict the number of seats that contesting parties were likely to win in the Punjab election of 2017. Other works [28,29] used social networks analyses in order to identify the position of a party in the online community by measuring its centrality. The most supported parties are in general those with an higher centrality. Bovet et al. [17,18] [18,30]. By virtue of this, the great challenge of algorithms and AI is to discover and interpret real data from 'junk data' that could lead to accurate predictions of electoral or opinion trends. A crucial limitation of social media based methods is also the misrepresentation bias. While social media solves the low response rates by 'surveying' millions of users with non-intrusive methods, these large number of respondents might not represent, again, the demographics of the voting population. Thus, the opinions of Twitter users may not be representative of the entire population [31] and re-sampling methods need to be used, importing along them the same problems that plagued the traditional polls.
In this work we first investigate why the traditional polls fail to predict elections. We focus on the results of the recent primary presidential election in Argentina on August 2019 and the subsequent presidential election on October 2019, which represents a classic example of a massive failure of the trusted pollsters in predicting a polarized election electorate, which in this case, led also to massive markets collapses in the country, since investors largely bet on the pollster predictions.
This study is possible thanks to the exclusive access to the raw data of longitudinal surveys conducted by one of the most reliable pollsters in Argentina, Elypsis [32]. The analyzed data include the original responses of subjects before performing the re-weighting for sampling bias and the subsequent results obtained after re-weighting. More importantly, the data includes a longitudinal study on the same 1,900 respondents before and after the election which allows to precisely study the social-desirability bias when the same voter change the response after the result of the election is known. This represents an unique opportunity to discover why the traditional polls have failed as pollsters do not normally share their raw data before re-weighting or sample-balancing [8,9] and few results have been performed on the same respondents before and after an election. The raw data of this pollster firm have been obtained by exclusive arrangement with the pollster responsible for conducting the polls of Elypsis (co-author Luciano Cohan) who has later founded his own company (Seido).
We find that a poor demographic representation combined with the inconsistency of opinion' respondents before and after the elections are the main reasons of the polls failure.
We find a large mis-representation of the sample in the surveys as compared with the voting population, which in Argentina is the general population since voting is obligatory and voter turnout is quite high at +80%. Even after re-weighting, this large sample bias produces highly inaccurate results since important segments of society are highly underrepresented in the polls. Beyond this sampling problem, the main problem we find is a clear tendency for the respondents to not tell the truth about their preference for a candidate (Fernandez) who was controversial and highly underdog in all polls and the media. This social-desirability bias was the main culprit for the failure of the polls.
To overcame these problems, we propose an AI model to predict electorate trend using opinions extracted from social media like Twitter. By using machine learning first developed in [17] we uncover political and electoral trends without directly asking people what they think, but trying to predict and interpret the enormous amount of data they produce in online social media [17,18,33,34]. Thus, these big-data analysis overcomes the low response rate problem. By re-weighting the Twitter populations to the Census data, we match the distribution of the population' statistics and the statistics of the real population (given by the Census Bureau) [31] thus minimizing the sampling bias of Twitter. Since social media users freely express their opinions in social media and our methods are not interventionist, the data are, in principle, free of social-desirability bias. The real time data processing which underlies our AI algorithm allows us to detect sudden change of opinions, and therefore different loyalty classes towards each candidate.
We will show that a cumulative longitudinal analysis tracking users over time performed on the loyalties classes to the candidates considerably improves previous results of [17], based on instantaneous predictions. Instantaneous predictions, as well as pollsters predictions, are subject to high fluctuations which undermine the reliability of the prediction itself. Instead, here we show that taking into account the cumulative opinions of users over a long period of time produces a reliable predictor of people's opinion. These improvements allow us to obtain an accurate prediction on a difficult election, which dodged all pollsters in Argentina.
Thus, we validate the algorithm on the primary and general election in Argentina. Our results in this particular case show that AI can capture the public opinion more precisely and more efficiently than traditional polls.

I. WHY POLLSTERS ARE FAILING TO PREDICT ELECTIONS?
The events leading up to the recent primary election in Argentina are a telling example of the failure of the polling industry [32,35,36]. On the primary election day on August 11, 2019 (called PASO in Spanish: Primarias, Abiertas, Simultáneas y Obligatorias; in English: Open, Simultaneous, and Obligatory Primaries), none of the pollsters in the country predicted the wide 16% margin of presidential candidate Alberto Fernández (AF) over the president Mauricio Macri (MM) [We clarify that primaries in Argentina are obligatory, happening for all political parties at the same time, and the two main parties presented only one candidate each, thus transforming the primaries into a de-facto presidential contest.] Figure 1 shows the comparison between the official results (in red), our prediction (in blue, Model 3 explained below) and the polling average, computed as the average of the top five most trusted pollsters in Argentina [32,36], i.e. Real Time Data, Management & Fit, Opinaia, Giacobbe and Elypsis (in green). Macri was clearly defeated by Fernandez by +16%, a result captured by our predictions. While the average pollster predicted Fernandez with a slight advantage in the primary, the estimated percentage of each candidates were, in general, really close reaching in some occasion a difference of just one percentage point [37].
Elypsis in particular predicted that Macri would win for one percentage point [38]. This virtual tie predicted by the pollsters was largely considered to be a win for the incumbent candidate Macri since he was supposed to gain all the votes left by the third party options in the subsequent presidential election and eventually win the election in a runoff.
It is worth to stress at this point that Macri (right-leaning candidate) made of the in- Below we analyze the raw data of one of the most reliable polls, Elypsis (trusted specially by the president Macri and international investors [32,37,39,40]) which as all the pollsters failed to predict the large gap between the two candidates for the primaries elections. To deal with the mis-representation problem, pollsters adjust their raw results to population benchmarks distributions given by the Census Bureaus [1,4] by weighting the raw data (sample-balancing or raking [8,9]). The poll sample is weighted so it matches the population on a set of relevant demographic or political variables, for instance, age, gender, location and other socio-economic variables, like education level or income. Studies of the effectiveness of various weighting schemes suggest they reduce some (30 to 60%) of the error introduced by the biased sample, see [1]. However, when the raw data distribution is drastically under/over sampled as the Elypsis case, a small error in the most representative groups would propagate to produce inaccurate result.
As discussed above, the mis-representation is not the only problem which traditional pollsters methods face. Next, we analyze the longitudinal data taken on the same 1,900 respondents by Elypsis before and after the elections to investigate the social-desirability bias. We start with Fig. 3a showing the Elypsis respondent distributions after PASO (notice that these respondents from the previous one, and this is the reason why the age distribution change respect to the previous figure). By comparing Fig. 2a before PASO with Fig. 3a after PASO we first notice a change in the voters distributions. Younger groups are better represented after the election when compared to Fig. 2a, although the data are still highly biased towards older generations. This implies that younger groups were, at least, more prone to answer the polls after the election than before.
Surprisingly, the female group with ages between 30 and 50 years voted for Fernández as indicated after the PASO polls, while before the PASO they responded mainly in favor of Macri. The male group of the same age shows a similar behavior, even if less pronounced.
Let us notice that, according to Fig. 2b, the groups of females/males between 30 and 50 years old are the most represented in the Census data and therefore may have an higher impact on the final result. These results can only be explained by admitting that voters did not say the true. This is further corroborated by this unique longitudinal panel, as seen in Table I, revealing that people lied and hid their true voting intentions to the pollsters before the elections.
More specifically, when comparing "Who are you going to vote in the PASO" with "Who did you vote in the PASO" -using the same sampling and postratification methodology than in the Pre-PASO survey -it is found that about 18% of the people did not disclose their true vote, and the hidden vote was not unbiased.
• 91% of those who said "I will vote for Fernandez" did so, but only 83% in the case of Macri, who lost 6% to AF.
• "Secondary candidates" voters were much more volatile, Only 56% of those who said that they were going to vote for (third candidate) Lavagna disclosed their true vote, and 54%, 53% and 59% in the case of other candidates Del Caño, Espert and Gomez Centurion respectively.
• Alberto Fernández got almost 19% of the votes of those who chose a secondary candidate in the Pre Paso Poll, and Mauricio Macri only 9%.
• Alberto Fernández received 46% of the votes of those who answered "Blank, Null or Unknown" before the PASO.
But, who hid -or not disclosed -their real vote? We find no significant difference between men and women or between education levels but we see a clear pattern in age demographics.
33% of those between 16 and 30 years changed their vote vs. their Pre PASO answer and only 13%, 10% and 14% on those between 31 and 50, 51 and 65 and more than 65, see Table   II.
What did those who did not disclose their vote think about the candidates? Where they "closeted Kirchnerists" (party of AF and CFK) or did they bridge the gap between Macri-Fernández?
"Regular" images of Cristina Fernández, Macri and Alberto Fernández were much lower among those that did reveal their vote than among those who did not, see Table III. Those who hid their votes look more nonpolarized, with a "Regular" image -No positive nor negative -of 21% on average, vs 6%/10% of those who revealed the vote. CFK's negative image is higher than MM (48% vs 38%) in "non-revealers" and the opposite hold in the revealers (43% vs 50%). 35% of the "non-revealers" did not have (or hid) their opinion of Alberto Fernandez vs. 8% in the revealers.
This combined information shed light on PASO results and Polls consensus miss. In the PASO, AF was able to catch votes from all the candidates, and seduce voters from within the gap, "moderate" voters who had a negative image of CFK and MM. He succeeded in standing himself as the "third candidate" bridging the gap, something that was not being fully captured by the polls, or that was decided at the last minute. This feature is most striking in young people, who may both have more "volatile" opinions and less prone to reveal them on traditional polls. This hidden-vote factor can explain by itself as much as 10% difference between "ex-ante" forecast and real results. Thus, standard polls methods failure may not have been related only to a bias in the sampling but, in the extraction of "True" information from surveyed people.
Understanding why people lie is not the topic of this work even if, according to the literature the reasons could be many and related to desirability-bias. On one hand, participants may typically rush through the surveys to obtain their rewards and don't respond thoughtfully [4]. On the other hand, social-desirability bias [11,12], i.e. the tendency of survey respondents to answer questions in a manner that will be viewed favorably by others [4,12] is another reason for people to hide their preference for controversial candidates like CFK, which leads to biased results.
In view of how the above issues of low response rate, mis-representation and the social desirability bias/lies (which in the case of Elypsis biased more the younger representative) undermined the predictions on the Argentinian primary elections, we next search for suitable replacement using sampling methods for the modern era of big-data science. In this scenario, a good candidate to substitute traditional polls is social media (Twitter in our study) which solves in one shot both the law response rate (million of people express their political preferences in the microblogging platform) and the social desirability biases. This is because social media users do not answer to any question, but freely express their ideas in a social medium platform. However, one may argue that Twitter is generally bias towards young people thus providing a biased sample. Thus, proper re-weighting of the data is needed, although the effects of re-weighting are expected to be less pronounced than in the polls of Elypsis. Below we introduce an AI model that builds up on previous work in [17] combining machine learning, network theory and big-data analytic techniques, that is able to overcome the problems presented so far and that correctly predicted the outcome of the 2019 Argentina primary and general elections.

II. METHODOLOGY
The algorithm we propose improves upon previous work from [17] and consists of four phases (see Fig. 4): data collection, text and user processing, tweets classification with machine learning and opinion modeling. While the first two phases are of standard practice in the literature, tweets classification by means of ML models only recently took place [17,18], given the impossibility to classify by hand millions and millions of data. Opinion modeling, the core of our election prediction model, is an attempt to instantly capture people's opinion through time by means of a social network. To improve upon [17], we consider the cumulative opinion of people and define five prediction models based on different assumptions on the loyalty classes of users to candidates, homophily measures and re-weighting scenarios of the raw data. Below we explain each phase, highlighting the steps that make our full-fledge AI predictor a good candidate substitute for the traditional pollster methods.  Figure 5a shows the daily volume of tweets collected (brown line) while Fig. 5b shows the daily number of users (green line). In blue we report the daily number of tweets/users which are classified, i.e. they posted at least one classified tweet. Users are classified with machine learning as supporters of Macri (Fig. 5d, red line) if the majority of their daily tweets are classified in favor of Macri (Fig.   5c, red line) or as supporters of Fernández in the other way around (blue line in Fig. 5d and c). Hereafter we use FF to indicates the Fernández-Fernández formula and with MP we refers to the Macri-Pichetto formula (the outgoing president/vice-president candidate).
The activity of tweets/users shows a peak on August 11, 2020, i.e. the day of the primary election. In the period from March to October, we collected a daily average of 282,811 tweets posted by a daily average of 84,062 unique users. We daily classified 75% of these tweets and ∼ 76% of the users ( see Table VIII and Table VII in the Supplementary Information ). In total, by the end of October we collected around 110 million tweets broadcasted by 6.3 million users. This large amount of tweets collected has no precedent and is relevant in the light of considering that Argentina is one of the most tweeting per capita countries in the world.
User and text processing. Below we explain the tasks that need to be applied to the raw data before any analysis is performed.
Bots detection. The identification of software that automatically injects information in the Twitter' system, is of fundamental importance to discern between "fake" and "genuine" users [30], the latter representing the real voters. According to [17] a good strategy is to extract the name of the Twitter client used to post each tweet from their source field and kept only tweets originating from an official Twitter client. Figure 6a  Text standardization. Stop words removal and word tokenization are of common practice in Data mining and Natural language processing (NLP) techniques [43,44]. For example, we keep the URLs as tokens since they usually point to resources determining the opinion of the tweet, through replacing all URLs by the token "URL".
Tweets classification. To build the training set we analyze the hashtags in Twitter.
Users continuously labels their tweet with hashtags, which are acronyms able to directly transmit the user feeling/opinion toward a topic. We hand labeled the top hashtags used in the dataset (see Table IX in the Supplementary Information ). They are classified either as pro M(acri), F(ernández) or T(hird party) candidate, depending on who they support (with Third party we refer to the supporters of Lavagna, Espert and other secondary candidates).
Hashtag co-occurrence network. In order to check the quality of the classification of the classified hashtags we build the hashtag co-occurrence network H(V, E) and statistically validate its edges [17,45]. In the co-occurrence network the set of vertices v ∈ V represents hashtags, and an edge e ij is drawn between v i and v j if they appear together in a tweet.
We test the statistical significance of each edge e ij by computing the probability p ij (p-value of the null hypothesis) to observe the corresponding number of co-occurrences by chance only knowing the number of occurrences c i and c j of the vertices v i and v j , and the total number of tweets N . Fig. 7 shows the validated network. We only keep those edges with a p-value p < 10 −7 . The blue community contains the hashtags in favor of Fernández, the red community those in favor of Macri and the green one (a very small group) are those in favor of the Third candidate. A look at the typologies of hashtags reveals the first differences in the supporters. Those in favor of Cristina Kirchner are much more passionate than the follower of Macri. For example, Kirchner's type of hashtags are #FuerzaCristina, #Nestorvuelva, #Nestorpudo or they are very negative to Macri as #NuncamasMacri. On the other hand, Macri's group is smaller and less passionate with hashtags like #Cambiemos or #MM2019 (see Fig. 8), while support for the third candidate has not taken traction and its electoral base on Twitter is very small.
In principle, counting the users and tweets according to the hashtags they use would predict the victory of Fernández over Macri. However this conclusion would be based only on ∼ 10,000 users (those expressing their opinion through hashtags). In order to get the opinion of all the users we train a machine learning model that classifies each tweet as AF, MM or Third party. (In what follows we also refer to the formulas FF for Fernández-Fernández and MP for Macri-Pichetto, the final formulas in the presidential contest). We use the previous set of hashtags expressing opinion to build a set of labeled tweets, which are used in turn to train a machine learning classifier. We use all the tweets (before August) which contain at least one of the classified hashtags to train the model. In the case of more than one hashtag for a tweet, we consider it only if all the hashtags are in favor of the same candidate. The use of hashtags that explicitly express an opinion in a tweet represents a "cost" in terms of self-exposition by Twitter users [46] and therefore allows one to select tweets that clearly state support or opposition to the candidates. The training set consists of 228,133 tweets, i.e. the 0.33% of the total amount of collected tweets and the ∼90% of the hand-classified tweets (253,482 tweets). In order to find the best classifier we used five different classification models, the logistic regression (LR) with L 2 regularization, the support vector machine model (SVM), the Naive Bayes method (NB), the Random Forest (RF) and the Decision Tree (DT). All these models are validated on the remaining 10% of the classified tweets (25,349). Table IV shows the results for the models. The logistic regression performs better than the other models with an average group accuracy equals to 83%. Also recall and F1-score are equal to 83%. Support Vector Machine is the second classified, with an average accuracy of 81%. It follows the Naive Bayes with and average accuracy of 79.5%, the Random Forest and the Decision Tree.
We recall that the logistic regression assigns to each tweet a probability p of belonging to a class. In our case such probability goes to one if the tweet supports Macri while it goes zero if it supports Fernández. As it is shown in Fig. 9 the distribution of p contains two peaks, one on the left and one on the right, divided by a plateau. This is an encouraging result, since it proofs the efficacy of the model to discern between the two classes. We classify a tweet in favor of Macri if p ≥ 0.66, in favor of Fernández if p ≤ 0.33. Tweets with a value of p in the plateau are instead unclassified, meaning that the tweet does not contain sufficient information to be classified in either camp. According to this rule, in average we classify 211,229 genuine tweets and 1,617 "fake" tweets per day (see Table VIII and Table VII in the Supplementary Information ).
Opinion modeling. We can infer users' opinion from the majority of the tweets they post. Let n t,F be the number of tweets posted by a given user at time t in favor of Fernández and let n t,M be those supporting Macri. We define an instantaneous opinion over a window of length w and a cumulative average opinion as follow. In the first case, a user is classified as a supporter of Fernández (at a given day t = d) if We start by investigating the instantaneous response of the users in a fixed window of time. Figure 10a shows the Twitter supporters dynamics over time obtained with a window average, w = 14 days. Users are classified as MP (in red), FF (in blue) or Others (in green). Figure 10b shows the supporters dynamic (thick lines) compared with Elypsis prediction (thin dashed lines) without considering the undecided users in the normalization.
In the same plots we also report the official results for both primaries and general elections.
The comparison between the two pictures stands out as a approximate correlation between the Elypsis and the AI results for each candidate. However, in the comparison among candidates predictions may sometimes differs, as for example, right before the beginning of August, Elypsis gave as favorite MP while the AI instantaneous prediction was in favor of FF. Overall, as for the pollsters results, window average analyses are representative of the instantaneous sentiment of the people. As we see from the figures, instantaneous opinions are affected by considerable fluctuations [17] which make the prediction not reliable. In  can support a candidate until few days before the elections, for then change her/his mind because of some particular facts. This and other possibilities can be taken into account only by a model based on cumulative analyses, but able to capture the degree of loyalty of people towards the candidates over time. Differently from the traditional surveys, the real time data processing that underlies our AI algorithm gives the possibility to take into consideration this scenario. To understand how different re-weighting scenarios affect the results, below we introduce different loyalty classes of users towards the candidates and then we define several models matching the criteria previously discussed. These loyalty classes can be only defined when we consider the cumulative opinion in a longitudinal study and cannot be investigated by traditional polls.
Loyalty classes. We define 5 classes of loyalty for users. Here we consider the MP supporters, but the definitions below similarly applied to the other candidates.
• Ultra Loyal (UL): users who always tweet only for the same candidate, namely T t=T 0 ( n M,t n M,t +n F,t +n T,t ) = 1. where with n x,t we indicate the number of tweets that a given user post in favor of x, with x ∈ { Macri, Fernández, Third party}.
Differently from the ultra loyal, which continuously post in favor of a candidate, the other classes take into consideration a possible change of opinion of a user. In order to detect sudden twist of opinions we focus on the classifications of the last k tweets posted by the users. We define: • Loyal MP → MP: a user which is MP since the majority of tweet are for MP, but she/he also supported MP in the last k tweets. Mathematically speaking N n=N −k n M,n > N n=N −k n F,n + n T,n . N is the total number of tweets posted by the user.
• Loyal MP → FF: users that are MP by the total cumulative count but they have tweeted for FF in the recent k tweets. In formula: N n=N −k n F,n > N n=N −k n M,n +n T,n • Loyal MP → TP: users supporting the third party in the last k tweets, i.e. N n=N −k n T,n > N n=N −k n M,n + n F,n .
• Loyal MP → Undecided: all other individuals classified as MP but not included above.
Let us remind that unclassified refers to all those users who do not have any classified tweet. Fig. 12 shows The percentage of the undecided is around 8% and the third party percentage. The other classes are close to 1 or 2%. In the next section we use these classes in order to define a better predictor.
AI Models. The loyalty classes introduced so far are one of the main differences with the other Twitter based studies: we use the machine learning classifier (logistic regression here) to define the loyalty of a user and not to make predictions. We do that by grouping supporters as follows: • Fernández supporters: all those users which are ultra loyal FF, loyal FF→FF, loyal FF→MP, loyal FF → Undecided.
• Macri supporters: all those users which are ultra loyal MP, loyal MP→MP, loyal MP→FF, loyal MP→Undecided.
In each group we put those users we are almost sure who they support because of their activity over time. However, as we saw in the previous section, undecided may play a central role in a scenario where few percentage points can flip the final result. Furthermore, understanding unclassified users (i.e. those users which do no not have any classified tweet) will also improve the final statistic. In order to take into account all the reasonable scenario we define three different models (starting from the classification in Fernández and Macri of above) and validate them against the final results of the election. Table V  In the next section we compare the performances of these models on the Argentina election.

III. AI-BASED FORECAST FOR THE ARGENTINIAN ELECTION
The models introduced so far allow us to define the daily supporters of each candidate according to their retweet activity. Indeed supporters are defined not simply according to the classification of the majority of their retweet, but on the basis of the loyalty classes they belong. Similarly to the simple tweets classification, we can define for each model an instantaneous (window average) and a cumulative (average) opinion. For this reason here we directly focus on the cumulative prediction for the models introduced in the previous section. Table VI reports  We define the MAE i (mean absolute error) for model i as i∈c (|x i −y i |)

3
.  The results of our analyses show that AI applied to big-data can be used to successfully understand people' opinions over time. The possibility of following the opinion of the same people through time, and therefore the chance of defining loyalty classes is a fundamental step in order to make good predictions. AI allows both to get the percentage of supporters toward a candidate and reveals what is behind these numbers, giving an idea of people sentiments. This is of particular importance when one of the candidates is a controversial politician and can generate different feelings leading to strong polarization and biased responses to pollsters, which are not trusted anymore by the great majority of people.
We expect that in the future traditional surveys may be incrementally replaced by these new non-intrusive methods. AI is a thermometer that provides the key to predicting not only the elections but the great trends that develop at the local and global levels. We have shown how AI allows to synthesize the opinion of millions of people including those silent majorities of hidden voters who would not be heard otherwise. We must not ignore that people are tired of answering surveys. AI can then deduce, predict, interpret and understand what people want to express.