Inferring the votes in a new political landscape. The case of the 2019 Spanish Presidential elections.

The avalanche of personal and social data circulating in Online Social Networks over the past 10 years has attracted a great deal of interest from Scholars and Practitioners who seek to analyse not only their value, but also their limits. Predicting election results using Twitter data is an example of how data can directly inuence the politic domain and it also serves an appealing research topic. This article aims to predict the results of the 2019 Spanish Presidential election and the voting share of each candidate, using Tweeter. The method combines sentiment analysis and volume information and compares the performance of ve Machine Learning algorithms. Several data scrutiny uncertainties arose that hindered the prediction of the outcome. Consequently, the method develops a political lexicon-based framework to measure the sentiments of online users. Indeed, an accurate understanding of the contextual content of the tweets posted was vital in this work. Our results correctly ranked the candidates and determined the winner by means of a better prediction of votes than ocial research institutes.


Introduction
For more than a decade now, with the emergence of Internet 2.0, users have been able to generate their own content and share it publicly with relative ease. This boom has witnessed a huge surge in the popularity of Online Social Networks (OSN), in particular the Twitter microblogging platform which allows its users to share text messages of up to 140 characters with their family, friends and followers (Bode & Dalrymple, 2016). More than 500 million messages, commonly known as tweets, are posted every day (Dietrich & Juelich, 2018;Marozzo & Bessi, 2018). Given that the main reason behind these publications is to express the opinion of the users, they have proven to be great interest to be analysed (Buccoliero et al., 2020;Mehta et al., 2020;Silva et al., 2020).
This huge amount of content circulating in OSN has attracted the attention of marketing agencies that seek capitalize on the behaviour of clients (current or future) to adjust their online campaigns and even use the social content (number of likes, retweets, etc.) to predict consumer behaviour in the real world (Rathor et al., 2018;Volkova et al., 2015). Recently this type of analysis has jumped into the eld of politics in an attempt to predict the results of election campaigns by monitoring the interaction between candidates and voters (Awais et al., 2019;Jaidka et al., 2019;Shmargad & Sanchez, 2020). The fact that more and more people are posting on the OSN had led to researchers and journalists to believe that a collective feeling is present in the OSN, which can be listened to, captured and analysed (Budiharto & Meiliana, 2018;Cury, 2019).
Advances in big data coupled with OSN have also regarded by the scienti c communities as an opportunity to mine online social interactions in a more sophisticated way than ever before. Bello-Orgaz et al. (2016) coined the term of Social Big Data for the rst time as follows: 'Those processes and methods that are designed to provide sensitive and relevant knowledge from social media data sources to any user or company from social media data sources when data source can be characterised by their different formats and contents, their very large size, and the online or streamed generation of information'.
In particular in the case of Twitter, few scholars have identi ed possible correlations between its activity (such as the frequency, volume of terms…) with general election outcomes. Others are still sceptical about the rigor of these studies and reproducibility of the results (Gayo-Avello, 2012). The debate is still incipient and moreover, there is no consensus about the method to apply. Consequently, different approaches have been developed to predict voting outcomes. The rst method is directly linked to the volumes and nature of the tweets e.g. retweets, supporters, likes, etc. (Heredia et al., 2018). The second concerns the techniques of sentiment analysis which compute positive or negative sentiment in the online posts towards candidates or political parties (Le et al., 2017). Finally, more innovative methods analyse the networks of social media users who actively support political parties or candidates (Shin et al., 2017).
OSN are new means of constructing the reputation of candidates and the strategies of social media campaigns. Nevertheless, in the use context of this information, the question of trustworthiness is paramount (Abu-Salih et al., 2019). Data credibility may vary depending on the reputation of the data producer or his/her intention to create noise, fake or bad news. The domain of the users also needs to be veri ed to ensure the tweets can be properly counted for the speci c issue studied . This manuscript presents an approach to capture, analyse and predict voting shares of the 2019 President Spanish elections taking into account an analysis of the sentiments expressed by the users of Twitter.
The novelty of our paper is to develop a speci c Spanish and political lexicon to classify online messages. To the best of our knowledge, no numerical extended comparison of different inferring methods has been conducted to predict votes using a trained political vocabulary (Gayo-Avello, 2013).
The experiments carried out use ve machine learning techniques for the election candidates. Our best model predicts the winner of the election and the ranking of the remaining candidates.
Considering how the electoral cyberspace is articulated in Spain, several possible spaces could have been considered for the analysis of a political campaign: Facebook, Twitter, Instagram or Pinterest. In this manuscript, Twitter is chosen for two reasons. Firstly, it is an abundant medium for researchers to take the pulse of the public opinion, leveraging the vast volume of content published by its users compared to other social media (200 billion tweets a year)[1]. Second, it facilitates easy data collection by providing API access to the Twitter world unlike other leading OSN like Facebook or Linkedin which do not.
In the race for the presidency, the political context in 2019 was historic. Indeed, ve candidates were competing for one of the most important Spanish political elections, including the socialist President Sanchez candidate who was seeking re-election. The other candidates were Rivera for the central-right party, Iglesias for the far-left party, Casado for the right party and Abascal for the far-right party (called "Vox"). The outstanding level of participation (75.7%) shows the importance of this scrutiny due to two principal factors.
On one hand, the context of high political divide due to the question of the Catalan referendum for independence and on the other the emergence of the far-right party whose manifesto included controversial issues such as the end of the authorized abortion or a recentralisation of the national governance around Madrid, the capital of Spain. The fact that "Vox" party is competing in this type of election for the rst time challenges the standard system of polls forecasting, which are usually based on the results obtained in the previous elections (Huberty, 2015). Voting participation was about 9 points higher than in the previous election and the highest turnout of the 21st century. The presidential election was held under a one-round voting system which took place on April 28 th , 2019.
The key contributions of this paper are three-fold: To develop a speci c framework to classify the sentiment of online users in a context such as a political Presidential election, To implement and benchmark ve machine learning methods to determine the optimal technique to be used to infer the voting share, To predict the election results and the ranking of the candidates. This remainder of this paper is organized as follows. Section II reviews the literature on the use of social media to infer elections while the section III explains the methodology used to analyse sentiment from tweets. In Section IV, the different algorithms are described and analysed, followed by the Results section (section V). Section VI concludes and proposes further lines of research to extend and improve this paper.

Related work
OSN are channels for politicians to construct their digital image and reputation (Grimaldi, 2019). They provide important platforms for the interaction and mobilization of voters before an election (Buccoliero et al., 2020). Amongst other online networks, Twitter microblogging allows its users to express their thoughts, feelings and opinions in the form of text. Users can post on a wide variety of issues and it is a space from which enables data extraction on public opinion using linguistic programming models. Then, Machine Learning (ML) techniques are commonly used to determine if a message posted by an user expresses a positive or negative sentiment (Mehta et al., 2020;Patel & Chhinkaniwala, 2019;Verma et al., 2019). These methods of analysis are known as Sentiment or Sentimental Analysis (SA). Le et al., (2017) claim that not all the potential voters express their opinion on Twitter and consequently the sample of data collected is not representative of the population. A real challenge resides in the capacity to manage and extract useful knowledge from online media data sources. They add that these networks easily allow the manipulation from propagandists who only add noise to the real message. Indeed, Auletta et al. (2020) show how an heuristic is able to add links inside a social network with the objective to promoting a speci c candidature. Their results reveal the extraordinary power of online media and the potential risks which emerge from the ability of political or marketing digital experts to control them. This risk represents a great threat as their power can be used to in uence undecided voters and consequently affect results.
To improve accuracy in the SA results, Bansal & Srivastava (2019) suggest widening the analytical to include emoji sentiments as part of the study. Other studies propose a veri cation of the consistency of the messages conveyed crossing information at post, user and domain level (Abu-Salih, et al., 2019a).
The trustworthiness of the social media data is indeed paramount . In this aim, Abu-Salih et al. (2019) developed a credibility framework incorporating semantic analysis and temporal factors to measure the domain-based credibility of the users. They further suggest a method to discover in uential domain-based users, rank them and identify "super-users" i.e. in uencers of a speci c online campaign.
The inferring performance depends largely on the capacity to process and analyse users' information. It is also directly linked to the capability to discard user's messages that are classi ed as spam i.e. unsolicited and repeated junk messages (Abu-Salih et al., 2020;Abu-Salih et al., 2018;Abu-Salih, et al., 2019b;. These tweets come usually from bots and have a malicious intention to create rumours and chaos (Shin et al., 2017). Shmargad & Sanchez (2020) compare the number of followers /enthusiasts of each candidate on Twitter for the 2014 U.S. Senate and the 2016 Congressional elections. They conclude that despite spending less money overall on their campaigns, candidates with greater indirect in uence on Twitter (i.e. more likes or retweets), obtain better results than their rivals and have smaller vote gaps than their respective national candidate (i.e. Hillary Clinton or Donald Trump). OSN become a much cheaper alternative than traditional channels (TV, radio, etc.) for politicians with little resources to start a campaign and build a name for themselves.
Prior to the 2016 presidential election, several studies (Avnit, 2009;Cha & Gummadi, 2010) stated that the number of followers (they coined them as indegree) was a sign that revealed the popularity of a user but not necessary related to his/her ability to in uence voters. Their study was summarised by the slogan of the 'fallacy of one million followers' which means that users follow others out of politeness or simply become followers of those who also follow them but have hardly read the tweets posted of those they follow.
But, the 2016 race for the White House shows that traditional media and Twitter platforms are critical mechanisms that in uence voters (Morris, 2018). Twitter by conveying candidates' messages about their ideas, policies, and future actions establishes fundamentals that resonate for the public as much as those sent by the traditional media. Consequently, there is growing interest in the creation of textual or linguistic classi ers to determine whether opinions expressed by individuals can be considered as favourable or unfavourable indicators of voting behaviour (Awais et al., 2019;Buccoliero et al., 2020;Silva et al., 2020). The methodological approaches that have been used so far, have been mainly based on a lexicon of speci c terms used or pre-tagged texts. Nevertheless, they have not achieved the level of accuracy that those obtained in the press articles written by journalists on topics such as business or consumer goods (Rathor et al., 2018). In this aim, a speci c lexicon for Spanish elections is developed in this paper. Grimaldi (2019) demonstrates that the number of information / conversations / messages circulating on Twitter in the 2019 Spanish Presidential elections grows as the election campaign period progresses, showing peaks in line with relevant events that occur or affect it. The volume of these posts falls after the election voting day. The SA technique needs to adapt to this evolution incorporating for instance new hashtags which were not present at the beginning of the campaign (Grimaldi, 2019). The fact that campaigns are primarily manifested through Twitter media before standard general channels like TV or radio raises the need to continue investigating the best approach to analyse the data available in these networks and reveal the messages conveyed. This article deals with this controversial line of research. It aims at answering rst if it is possible to infer the voting share for each candidate in an election by using Twitter, and secondly, if this media is more reliable than standard public opinion polls in a new political landscape (Shmargad & Sanchez, 2020).
[1] http://www.dsayce.com/social-media/10-billions-tweets Last visit 17/4/2020 Methodology This section presents the different steps of the SA methodology. In particular, in the last section, the methodology used to create a new political lexicon is detailed. The description of the six ML techniques used for SA and to infer the voting intention for each candidate are discussed in the next section.
The SA method is related to Natural Language Processing (NLP), computational linguistics, and text mining (Manning et al., 2002). In sentiment classi cation, which is also presented in this section, the Logistic Regression (LR) is used to determine whether a tweet expresses a positive or negative sentiment.
The process of data classi cation consists of two phases; the training phase along with the prediction phase as shown in the Figure 1.

1-Data collection
With the Twitter Streaming API[1], all the posts containing one or more hashtags related to the presidential election are collected every day as follows: Election general information = #28A; #28Abril; #Vota; #Vota28A; #Elecciones Generales; #28 Abril; #EspanaVaciada Political party = #ValorSeguro; #Casado; #laEspaña que quieres; #PedroSanchez; #LaHistoriaLaEscribesTu; #PabloIglesias; #VOX; #SantiagoAbascal; #VamosCiudadanos; #AlbertRivera The tweets are collected in real time, 24 hours after their publication. The aim is to get statistics related to the popularity of the tweet such as (a) the number of shares or retweets, which shows how many users shared the tweet with their followers; (b) the number of likes, which shows how many users linked the tweet (c) the number of retweets which shows how many tweets were re-broadcasted on the media.
Between April 12 th and April 28 th 1.170.000 tweets were posted and table 1 shows the amount of data collected in terms of users and tweets per candidate. When working with the Twitter API it was necessary to take into account certain limitations of it. The frequency limit of the interface only allows 450 requests every quarter of hour.

1-Data preprocessing
The dataset resulting from data collection is a collection of tweets. In order to apply SA techniques, data must go through a four-step process, known as pipeline, in which the input of each step is the output of the previous one.
The rst step is called the text cleaning. A tweet is a complex object with many properties. But what it is important is the message that the user writes in the form of a text. So, the URLs, emails are deleted and necessary functional words of the Twitter lexicon such as \RT, \HO, \HT, unknown characters are replaced to their closest ASCII variant, using the R Unicode Library. Each document is then transformed into a list of words between blanks or punctuation which is called tokens. During this step, the information about the author of the tweet and the date of publication are lost. But the reference to the object to be able to link the result with the original tweet is kept.
The second step is called the lexical normalization. It removes the functional words which do not have a clear referential semantic, such as articles, pronouns, preposition such as ("the", "an", "and", "of", etc.), numbers or dates. They are usually called as stop words. Uppercase letters are transformed to lowercase. This allows the algorithm to consider identical words such as "Big" and "big" and consequently as two counts of the same word. It deletes also any repeated sequences of letters in the words (such as "greaaaat", "mooorning", etc.) reverting them back to their original English word.

2-Feature extraction
The next step is the feature extraction applying the Term Frequency (TF) and Inverse Document Frequency (IDF) R package. The Vector Space Model (VSM) is one of the most commonly used methods for text retrieval and to represent a document as a vector of terms (Letsche & Berry, 1997;McCarey et al., 2006). It consists of transforming each token (i.e. tweet text) into a feature vector in an Euclidean space.: t = {t1, t2,… tn} ∈ R s where s is the size of the vocabulary which is built by considering unigrams in the collection (i.e. the training test) (Manning & Raghavan, 2009). The sum of tokens builds a sparse matrix or also called Corpus (C).

3-New polarity political lexicon
A SA system is highly sensitive to the domain in which the data used to train are extracted. It can obtain poor results if the training dataset is not political (Abu-Salih et al., 2018;. Due to the lack of a sentiment lexicon for non-English languages, the creation of a new polarity lexicon is decided for the Spanish political event issuing from two different sources. The starting point is the dataset created by (Molina-González et al., 2013) for Spanish tweets. Then, a random sample of 1.000 tweets is chosen as the training set. In order to label a tweet as either positive, negative or neutral, eight volunteers in group of 2 (i.e. 4 groups) are asked to assign a label for each tweet according to the sentiment they feel it conveys. The evaluators have access to the full tweet, which is tagged according to its global polarity, indicating whether the text expresses a positive, negative or neutral sentiment. To prevent ambiguity, testers have to assess the tone of some tweets after screening the other tweets of the account. Moreover, if there is no agreement inside the group, a third independent volunteer from another group is asked to help to decide. Volunteers agreed on 80% of tweets which support the statement (Villena et al., 2015) and determined which additional word had to be included in the lexicon. In total 187 words are added based on this manual testing and inserted in the nal vocabulary.

Machine Learning Techniques
This section presents a detailed description of the ML techniques used to analyse the collected data, search for patterns and to create estimated voting predictions. In particular, the settings used to train the Machine are explained in detail in the hyper-parameter and evaluation criteria subsections.
The choice of these algorithms is motivated by a review of the existing literature that compares their features in terms of regression method (linear versus non-linear), complexity, performance, accuracy, memory and processing requirement along with time required for training classi er (Caruana et al., 2006;Mehta et al., 2020).

Logistic Regression
The method of estimation of a linear regression consists of determining the coe cients which minimize the sum-of-squared deviations of the observed values of Y from the predicted values based on the model.
Two principles distinguish a Logistic Regression (LR) from a linear regression. The rst is that the outcome variable in LR is binary or dichotomous. In any regression problem, the key quantity is called the conditional mean and is expressed as 'E (Y |x)' where Y denotes the outcome variable and x denotes a speci c value of the independent variable. In the linear regression model, an observation of the outcome variable may be expressed as y = E(Y|x) + ε. The quantity ε is called the error and expresses a deviation from the conditional mean. The most common assumption for a linear regression is that ε follows a normal distribution with mean zero and some variance is constant across levels of the independent variable. In turn, in a logistic regression, the binomial distribution is the statistical distribution of the errors on which the analysis is based (Hosmer et al., 2013). This is the second difference.
The SA system classi es a given vector as either positive, negative or neutral by adding all the number of positive words and subtracting the number of negative words. It assigns only one label class to the tweet. This type of analysis makes sense if each tweet expresses a single opinion. It may seem like a limitation, but in practice it works well since users usually focus on a single topic in each tweet. Surely in other contexts, or if this limitation of message length were not available, it would be necessary to consider more complex systems that allow more granular analysis (Bode & Dalrymple, 2016).

Linear Discriminant Analysis
Linear Discriminant Analysis (LDA) also known as Fisher Discriminant Analysis is very similar to Logistic regression (Welling, 2007). It is a method used to nd a linear combination of features that separates different classes of objects. LDA searches for those vectors in the underlying space that best discriminate among classes (rather than those that best describe the data). The LDA is optimized when the independent variables are normally distributed. It is indeed an important constraint of the model. Mathematically wise, for all the samples of all classes, two measures are de ned (1) and (2): 1) one is called within-class scatter matrix, as given by: (see Equation 1 in the Supplementary Files) 2) the other is called between-class scatter matrix, as given by: (see Equation 2 in the Supplementary Files) Then, LDA can be formulated as an optimization problem to nd a set of linear combinations (with coe cients ) that maximizes the ratio of the between-class scattering to the within-class scattering. One way to do this is given by the following generalized eigenvalue problem (3) (see Equation 3 in the Supplementary Files)

Classi cation and Regression Trees
Classi cation and Regression Trees (CART) is an umbrella term used to refer to decision or regression trees (Breiman et al., 1984). While the target variable can take a discrete set of values, decision trees are called classi cation trees. It is a ow-chart structure where each fork represents conjunctions of features that lead to class labels (also called the leaves). Since the branch can be seen as a split in a predictor variable, each end node contains a prediction for the outcome variable. When the target variables can take continuous values (i.e. real numbers such as voting prediction), decision trees are named regression trees. The decision tree continues to expand with additional nodes being repeatedly inserted until the stopping criteria is met. In other words, it stops when the predicted number of iterations is reached, or a reasonable prediction is achieved.
k-Nearest Neighbourhoods k-Nearest Neighbour (kNN) classi cation determines the local decision boundaries. The technique assigns any object to the majority class of its k closest neighbours. The main constraint of this method is that the results may change if we change k. KNN is also a 'lazy learner' since all computations are deferred until function evaluation. In this study, the kNN regression is used to estimate continuous variables. This algorithm uses an average of the k nearest neighbours, weighted by the inverse of their distance.

Support Vector Machines
Vapnik (1995) rstly introduced Support Vector Machines (SVM) to solve two-class recognition problems. The general idea is to nd the decision surface that maximizes the margin between data points, i.e. between classes. A good separation is achieved by the hyperplane that is furthest from the nearest training-data point of any class (so-called functional margin). In general, the larger the margin is, the lower the generalization error of the classi er is also. In the case of originally linearly non-separable data points, the original data vectors are mapped to higher dimensional space to achieve linear separability again.

Random Forest
The Random Forest (RF) is an algorithm that randomly creates and merges multiple decision trees into one "forest." The goal is not to rely on a single learning model, but rather a collection of decision models to improve accuracy and reduce the variance. The primary difference between this approach and the standard decision tree algorithms is that the root nodes splitting is generated randomly. RF corrects the decision trees recurrent error of over tting of the training set i.e. it has low bias, but a very high variance. RF generates many predictors, each using regression or classi cation trees. It reduces instability by averaging multiple decision trees, trained on different parts of the same set of data. Then, it develops a nal prediction which usually improves the prediction performance of a standard decision tree.

Hyper-parameter settings
According to Jungherr (2016), model performance tends to uctuate depending on the methods employed even if the impact of the parameters is hard to know. Heredia et al. (2018) add that models based on speci c counts of tweets that only mention a single party outperform others that combine SAs with volumetric information. Therefore, a correct setting of the hyper-parameters helps to limit negative effects like over-or under-tting. Respectively in an SVM, kNN or RF method, a higher C, k, or max-depth may cause the model to misclassify less but is much more likely to generate over t. The Appendix 1 presents the selected settings of hyperparameters. The experiments for this study were carried out using the t.models library in R software. The evaluation criteria for the ML techniques are presented in the next sub-section. Since the errors are squared before they are averaged, the main differences between RMSE and MAE is that RMSE gives a relatively high weight to large errors. This means it would be more useful when large errors are particularly undesirable (Wang & Lu, 2018).
Jaidka et al., 2019 state such consideration is important when minor political parties end up becoming more active on social media platforms than leading parties. In this study, large and small parties are equally active (table 1), so the ve models are benchmarked using MAE results.
Precision, Recall, and F-score are used to measure classi cation performance. Formulas applied are, respectively (6) and (7) (Manning et al., 2002): (see Equations 6 and 7 in the Supplementary Files) Precision is the division of retrieved instances that are signi cant. Recall is the fraction of applicable instances that are recovered. In dual classi cation, recall is known also as sensitivity i.e. the possibility that a relevant document is recovered by the query. (see Equation 8 in the Supplementary Files) The F-score balances the use of precision and recall measuring the accuracy of experiments. It is the weighted harmonic mean of the precision and recall (8).

Results
This section presents the main results of the study. The performance of the sentiment classi er and the ve inferring models are shown followed by a comparison of them. Finally, a prediction of the votes is presented in the last sub-section using the best ML technique.

Linear Discriminant Analysis classi er
For the lexicon-based sentiment classi er, the data are split and the process of cross validation is repeated during k iterations. K is optimized when increased slowly and classi er accuracy is sought. The optimization process (k=3) is presented in the graph 1.
Graph 1: Optimizing k for the sentiment analysis classi er The system achieves a macro-averaged F1-score of 92,65% and an accuracy of 92.70% on the test set. These similar scores also con rm there is neither over-, nor under tting of the model.

The ve Machine Learning inferring model
The independent variables used by the inference method and computed in a daily basis are: 1) Day tweet volume: the sum of tweets on that day mentioning candidates.
2) Day unique tweet volume: the sum of tweets on that day that only mentions one candidate.
3) Day Twitter user number: the sum of different Twitter users with at least one tweet mentioning a candidate.
4) Day unique Twitter user number: the sum of different Twitter accounts whose posts only mention a candidate.
5) Positive or negative tweet volume: the sum of positive or negative posts that mention a candidate.
6) Positive or negative-based Twitter user number: the sum of different Twitter users with at least one positive or negative post mentioning a candidate.

7) sentiment score per tweet
The features are normalized by applying the moving average smoothing technique over a window of the past seven days. The polling data is considered as the dependent variable of our model.

Comparison of the results
The voting intention inference is approached as a multiple regression analysis. In this way, several regression models are built to infer the vote of each candidate in the electoral campaign, using the aggregated polling as the output variable of the models. In total, ve models are built i.e. one model per candidate. The ve different algorithms: LDA, CART, kNN, SVM and RF are then evaluated.
The performance results of the ve algorithms are analysed for one candidate: Pedro Sanchez (Graph 2).
The complete results by model and candidate are shown in the table 2. The empirical results demonstrate that kNN with the lowest MAE are the best predictions and outperforms the rest of algorithms on the ve models. LDA with the highest MAE has the worst performance.
Graph 2: Performance of the ve models applied to Pedro Sanchez candidate Inference results from the kNN best model Using the function predict in the R package, the 2019 Presidential voting shares are shown in the table 3 using the best model (kNN). The rst column of table 3 shows the 2019 Spanish Presidential o cial results and in the third one the voting intention inference based on kNN best method. As far as the second column is concerned, it presents the last polling results taken ve days before the election in compliance with Spanish law regulating the presidential election campaigns. The results are discussed in the next section.

Discussion
The signi cant number of undecided voters at the beginning of the electoral campaign and the political con ict between regions (especially between Catalonia and Madrid) presented us with an interesting challenge in predicting the outcome of the Spanish presidential election. The results of this study show the method a) correctly ranks the candidates and determines the winner b) gives for the winner of the election (Sanchez) a better prediction of voting share than the previous and de nitive polls, c) provides a prediction equivalent as the last poll for Abascal candidate, the "surprise" of the scrutiny since his party "Vox" was competing for the rst time in these elections.

Algorithms comparison with State of Art
Presidential campaign prediction using Twitter data is a much-discussed research topic and the number of studies are still growing (Awais et al., 2019;Bansal & Srivastava, 2019;Heredia et al., 2018;Jaidka et al., 2019;McGregor et al., 2017;Verma et al., 2019). There is, however, no real benchmarking on the performance (Gayo-Avello, 2013) to identify the best method to determine the winner of the elections or the voting share with the minimum error. Nevertheless, works from other research domains such as Medicine, Engineering or Finance, highlight calibration as a remarkably effective method to obtain better performance (Caruana & Niculescu-Mizil, 2006). They conclude also LR performs better for smaller data sets and has worst results for larger ones (Perlich & Simonoff, 2003). The inferring results of this study corroborate them. Indeed, LDA with the poorest MAE has the lowest training time required compared to the other algorithms and the data set comprised of 17 days of campaign and 1.170.000 tweets can be considered as large.

Inferring results comparison with State of Art
When predicting the vote rates, Gayo-Avello (2013) con rm MAEs are normally used to measuring accuracy. However, they inform that it is di cult to compare MAEs between campaigns. Having said that, a MAE baseline between 1% -2% can be considered as a good prediction model. Consequently, the MAE of the kNN algorithm positions this paper in the upper quadrant of Twitter prediction performances (Gayo-Avello, 2013).

Undecided voters
The results of this paper are less accurate than the research institute to estimate the voting share at candidate level: Casado, Rivera and Iglesias. These latter positioned in the centre of the Political Exchequer ( In the following lines, we suggest additional ways to improve this present model. Tweeter should analyse if including this framework could improve the inferring results by discarding inconsistent tweets injected in the prediction algorithms. Nevertheless, the outcome could be limited since the tweets contain hashtags which were carefully selected to ensure they were strictly related to the Spanish presidential election domain analysed (e.g. #28A, #28Abril, #Elecciones Generales).

Spam detection
Many presidential candidates hire companies to create robot accounts to follow them. These anomalous users or spammers post tweets supporting them. Those messages and Twitter accounts don't represent the general public opinions and have to be discarded. Their objective is to create positive or negative rumours. In this manuscript, detection of spams or irrelevant posts is realized by implementing a ML supervised method, a RF algorithm using the following features: spammer users previously identi ed, age of the user account, retweet rate and number of favourites. But, more recently Abu-Salih et al. (2019b) include a time-based semantic analysis at domain and user levels. Their technique provides a solution to rank tweeters' credibility. Consequently, further work could analyse if this novel approach improves the results of the cleansing phase to detect untrustworthy users i.e. spammers.

Computational time
In ML techniques, the splitting of the data is performed in a strati ed way and the process of cross validation is repeated during k iterations. This method is more precise but could have the disadvantage of slowing down the ML computational times. Ahmadvand et al. (2019) suggest the sampling of the computing techniques as an alternative. They add that this usually generates the desired quality of result when resources such as time, cost or energy are limited. Another option resides on progressive computation i.e. when large data space is pruned progressively to look for the results (Ahmadvand & Goudarzi, 2017). The processing times were affordable, but this method could be used if scholars decide to extend the time period of data collection or the number of hashtags and get a larger dataset.

Conclusion And Future Works
This study has achieved the objectives to develop a fast, cheap and reliable tool for understanding online public sentiment and predicting Spanish Presidential election results. First, it develops a method which correctly ranks the candidates and determines the winner (Sanchez). Second, it benchmarks ve different ML inferring algorithms and the best model assigns a better prediction of votes to the elected than the o cial research institutes. Finally, it builds a political lexicon-based framework to measure the sentiments of online users.
For digital marketers and political campaigners, it represents an important tool along with the traditional and standard surveys to gauge the strategy of their candidate and correlate it with potential voting shares. In this line, the dataset was created by mining Twitter for the last 17 days of the campaign; however, future works could extend it and create an automated framework which collects data over a longer period. Indeed, election results predicting is a continuous process and requires analysis over months or years. So, while developing predictive models from social media, time could be an important factor in assessing the sustaining preference for a political party. It could also offset possible campaign incidents such as when a candidate's popularity suddenly skyrockets.
Scienti cally speaking, this study has implemented and compared ve machine learning methods to determine the optimal technique which was used then to infer the voting share. It obtains an error of estimation situated in the upper quadrant of Twitter prediction performances. It has also leveraged a speci c lexicon for Spanish political context. However, we believe two additional lines of research could improve this work. On one hand, future studies should analyse and propose weighting factors to include the 'inaudible' voice, i.e. the part of the population who does not express their opinion using social media tools, in the statistical balance. On the other hand, they should nd a way to process demographic data (e.g., sex, age, and geographic location) in the method so as to be at par with the statistical sampling methods. The nal objective is to reduce the response bias and obtain a more representative vision of the Equations.pdf