Predicting referendum results in the Big Data Era

In addressing the challenge of Big Data Analytics, what has been of notable significance is the analysis of online search traffic data in order to analyze and predict human behavior. Over the last decade, since the establishment of the most popular such tool, Google Trends, the use of online data has been proven valuable in various research fields, including -but not limited to- medicine, economics, politics, the environment, and behavior. In the field of politics, given the inability of poll agencies to always well approximate voting intentions and results over the past years, what is imperative is to find new methods of predicting elections and referendum outcomes. This paper aims at presenting a methodology of predicting referendum results using Google Trends; a method applied and verified in six separate occasions: the 2014 Scottish Referendum, the 2015 Greek Referendum, the 2016 UK Referendum, the 2016 Hungarian Referendum, the 2016 Italian Referendum, and the 2017 Turkish Referendum. Said referendums were of importance for the respective country and the EU as well, and received wide international attention. Google Trends has been empirically verified to be a tool that can accurately measure behavioral changes as it takes into account the users’ revealed and not the stated preferences. Thus we argue that, in the time of intelligence excess, Google Trends can well address the analysis of social changes that the internet brings.

far. According to recent research [14], what can be observed is an increased use of online data in polls and measuring voting intention. As Internet penetration is significantly growing in western countries and the significance of the Internet cannot be doubted [29], the monitoring and analysis of online search traffic data can be proven valuable in nowcasting election and referendum races. Burnap et al. [30] suggest that changes in online behavior can be accurately depicted on Internet data, while political campaigners increasingly use social media and online platforms [31]. Up to this point, the field of predicting referendum results using online search traffic data has not been much explored, though Google Trends' data have been employed to predict the results of the 2015 Greek Referendum [14].
Finding new ways of predicting the outcome of elections and referendum is crucial, as the poll agencies have not always been successful in the recent past to well approximate voting intentions and results. During the short pre voting period of the critical 2015 Greek Referendum, we monitored the Google queries on the YES and NO keywords, so as to predict the voting intentions and the referendum outcome. Our results better approximated the official referendum outcome than poll agencies, the latter suggesting that the YES and NO difference was very small, or that YES was to win the race [14]. This method has also been applied to five other referendum races concerning crucial EU or constitutional matters and received international attention over the past years, namely the 2014 Scottish Referendum, the 2016 UK Referendum, the 2016 Hungarian Referendum, the 2016 Italian Referendum, and the 2017 Turkish Referendum.
This paper aims at presenting the methodology of how to predict and nowcast referendums with the use of online search traffic data; a methodology tested and validated in six occasions. This methodology aims at becoming a point of reference for the next generation of polling, while showing that Google Trends is a valid tool in nowcastings and predictions, with great potential. The rest of the paper is structured as follows: The "Research methodology" section consists of the method of predicting referendum results, while the "Results" section consists of the presentation of the results of six referendums over the past years (2014)(2015)(2016)(2017) in the European Continent. The "Discussion" section consists of the discussion of the results, while the "Conclusions" section consists of our conclusions and future research suggestions.

Research methodology
Google Trends allows the monitoring of the change in the online interest in a term in a country or region over a selected time frame, e.g. a range of years, 1 year, 90 days, 30 days, 7 days, 4 h, 1 h, or a specified time period. It also allows the multiple relative comparisons of a term in different regions, or the comparison of various terms in one region, while a feature offers the opportunity to compare different terms in different regions. Data, depending on the time frame selected, are either monthly, weekly, hourly, or even of 1 min intervals.
Data from Google Trends are downloaded online in '.csv' format and are normalized over the selected period, so as to allow the easy comparison between the respective examined terms. The adjustment of the data, as reported by Google, is as follows: "Search results are proportionate to the time and location of a query: Each data point is divided by the total searches of the geography and time range it represents, to compare relative popularity. Otherwise places with the most search volume would always be ranked highest. The resulting numbers are then scaled on a range of 0 to 100 based on a topic's proportion to all searches on all topics. Different regions that show the same number of searches for a term will not always have the same total search volumes. " [32].
In general, data from Google can be accurate for measuring the public's interest if the terms are carefully selected [33]. Preis et al. [5] for example, in developing the Future Orientation Index, only used Arabic numerals as the selected keywords, as they are universal, so spelling or translation errors would not produce bias. In a referendum, it is imperative that the terms selected are the ones to give an accurate result. Thus the wording of the referendum plays a significant role in predicting the outcome. In most cases, the question of a referendum is answered with a simple YES or NO, though, for example, in the 2016 UK Referendum, the wording did not allow for this, deeming this race more complicated to predict. In this case, the available answers were 'Remain a member of the European Union' and 'Leave the European Union' , so we selected the terms "Remain" and "Leave" in order to predict the result. For the rest of the examined cases, we used the translation of the YES and NO keywords in each respective language, that is 'YES' and 'NO' in Scotland, 'NAI' and 'OXI' in Greece, 'IGEN' and 'NEM' in Hungary, 'SI' and 'NO' in Italy, and 'EVET' and 'HAYIR' in Turkey. Note that the keyword choice is not case sensitive in Google Trends.
For predicting the referendums, we mainly use weekly and monthly data (daily intervals), apart from the case of Greece, where daily and hourly data were used due to the very short pre-voting period of 9 days. Apart from the 2014 Scottish Referendum and the 2015 Greek Referendum, data were downloaded from the field 'Campaigns and elections' in Google Trends that eliminates noisy data, i.e. hits that are not attributed to the examined event. If data in this category are not sufficient due to low interest on the subject (as, for example, in the Hungarian Referendum), the search can be widened to include hits in the field 'Politics' .
The time-frames for the data used in this study are as follows: For the i-th set of data, i.e. monthly, weekly, 4-h, hourly, the hits' averages for the YES (and Remain) Y t i and NO (and Leave) (N t i ) keywords are calculated, and are then percentized for YES and NO, respectively, with

Scottish independence referendum (2014)
On September 18th 2014, the Scottish people were asked to vote on whether or not they wished for Scotland to remain part of the United Kingdom or become an independent country. 84.6% of the eligible voting population showed up to cast their vote, with 55.3% voting for remaining in the UK, i.e. voted for NO, while YES received a 44.7% [34].
Most polls during the last month before the race suggested that NO was on the lead with about 5% difference, not counting the undecided [35][36][37][38][39][40], though the undecided percentage varied. Using data from Google Trends for the last week before the referendum race, i.e. from September 10th to 17th (Fig. 1a), the average percentized hits for YES were 44.07%, while the NO hits were at 55.93%. Though for the last month before the race, i.e. from August 17th to September 17th, the percentized average of the NO hits were at 64.47% (Fig. 1b), the percentized daily NO hits for the last two days before the race, i.e. on the 15th and 16th, were at 53.17% and 56.10%, respectively. As is evident, using Google Trends data a close approximation of the final referendum result was possible, with the percentized average of the hits for the last week before the referendum being almost the same as the official result.

Greek bailout referendum (2015)
In 2015, Mavragani and Tsagarakis [14] well approximated the voting intentions and results of the Greek Referendum on July 5th; a crucial referendum on the rescue plan proposed by the EU [41] that received wide national and international media attention [42]. Despite the very short pre-voting period of 9 days, by applying the proposed method we accurately predicted that NO was on the lead, despite that official voting intention polls suggesting that YES and NO were very close to one another, even at a 0.5% difference [43]. The NO vote officially received a total of 61.31% [44]. The percentized hits for NO in Google Trends from June 27th (20:00) to July 4th (20:00) were at 55.99%, while the NO hits from Saturday the 4th (20:00) to Sunday the 5th (20:00) were at 58.20%, as reported by Mavragani and Tsagarakis [14]. The poll agencies' results predictions varied for the NO vote: 54.50% [45], 54% [46], and 49% [47]. It is evident that the method of using percentized Google Trends' data better approximated the official results and this shows great potential in predicting and nowcasting referendum voting intentions and results.

UK European union membership referendum (2016)
The 2016 UK Referendum on whether the UK was to remain a member of or leave the European Union was held on June 23rd. The opinion polls during the pre-voting period were contradicting, with several results suggesting that Remain was on the lead [48,49], while others suggested that Leave was ahead [50]. The UK Referendum received wide international attention-media [51,52] and scientific wise [53][54][55]-during the pre-voting period. Online pollings that were conducted suggested that Leave was leading, while traditional ones suggested a head-to-head race [56].
The official results put 'Leave' at 51.9% and 'Remain' at 48.1% [57]. Analyzing online search queries for the month before the Referendum, we observed that there was a rise in Remain during the last days before June 23rd, but Leave was almost at all points above Remain, as shown in Fig. 2, which consists of the percentized hits in the remain and leave keywords (a) from May 24th to June 21st, and (b) from June 16th to June 23rd.
A shift in favor of the Remain camp was observed following the murder of Jo Cox, a Labor Party MP and supporter of Remain, on the 16th [58]. As shown in Fig. 2a, the only point where Remain is ahead of Leave in Google Trends is the day after Cox's murder, at 52.63%, indicating that what the polls suggested was also immediately depicted in online searches. For the last day of the Referendum, the percentized averages of the monthly Google Trends' data put Remain at 48.19% and Leave at 51.81%, as shown in Fig. 2, while the averages of the daily percentized hits for the month before the race put NO at 60.65%. The average percentized YES and NO hits for the week before were at 45.04% and 54.96%, respectively, while the percentized hits for Remain and Leave on the last day were at 47.37% and 52.63%, respectively.
Contrary to what many polls on voting intentions and the result prediction published on the closing of the ballot boxes suggested, Google Trends clearly showed that 'Leave' was on the lead and was going to win the race. Figure 3 shows the comparisons for the Fig. 3 Comparisons of the official UK Referendum results with Google Trends, and poll agencies term 'Leave' in Google Trends for the last 2 days of the referendum, i.e. the 22nd and 23rd-from monthly (May 24th to June 23rd) and weekly (June 16th to 23rd) datasetsand the weekly percentized average based on Google Trends, compared to the official results and the poll agencies' predictions in descending order.

Hungarian migrant quota referendum (2016)
The Hungarian referendum on immigration policy addresses a different methodological question than the rest. How is the low turnout on the day of the race depicted in online searches during the pre-voting period? While monitoring the Hungarian Referendum, apart from once again providing a good approximation of the outcome, what was of notable significance was the low volumes of data on the YES and NO (IGEN and NEM, respectively) keywords that the citizens of Hungary searched for on the web. This was later depicted on the day of the Referendum race, with more than 55% of the eligible voting population not showing up to vote [59]. Figure 4 shows that in the 'Campaings and Elections' field of search in Google Trends, the data based on the hits in the YES and NO searches are not sufficient to provide results. This indicates that the Hungarian voters were not interested enough in the Referendum to search for it online. Based on the very low turnout on the day of the Referendum race on October 2nd 2016-only 44.04% [59]-we conclude that this aforementioned low turnout is also depicted on Google queries.
Thus, in order to approximate the voting predictions and referendum outcome, we widened the search on the YES and NO keywords to include all hits related to politics, i.e. in the 'Politics' field in Google Trends, of which 'Campaigns and elections' is a subfield. Figure 5  It is evident that this method of predicting referendums has been valid in closely approximating the official results-that gave NO a 98.36% [59]-with a 95.90% percentized average for the last week before the race. We once again see the poll agencies not providing accurate results, giving NO at several points during the last month before the race 64% [60], 70% [61], and 78% [62], with only the exit poll closely approximating the results, giving NO a 95% [63]. The Italian Constitutional Referendum took place on December 4th 2016, becoming a highly discussed referendum race, as it brought to light the overall euroskepticism evident in the EU countries over the past few years. As the referendum wording allowed for a YES/NO answer, the selected terms for analysis were 'SI' and 'NO' , translating from Italian into 'YES' and 'NO' , respectively. The official Referendum results put NO at 59.12% and YES at 40.88%, with the turnout being 65.47% [64].  Figure 6 depicts the percentized hits for the YES and NO keywords (a) over the trimester before the referendum race, i.e. from September 3rd to November 30th, and (b) from October 28th to November 11th (up to a week before the race). The monthly averages for September put YES at 54.28% and NO at 45.72%; in October, YES and NO were at 48% and 52%, respectively, while for November, YES was at 47.39% and NO at 52.61%. The monthly average for YES and NO from October 28th to November 26th are 47.70% and 52.30%, respectively.  Figure 7 shows the percentized hits on the YES and NO keywords from the week from November 19th to November 26th and for the week from November 26th to December 3rd (hourly data). For the week starting 19, the average of the percentized hits for YES and NO were 49.21% and 50.79%, respectively, while for the week starting 26, the percentized hits were 49.57% and 50.43%, respectively. Figure 8 consists of the comparisons of the official results of the referendum race with the predictions using Google Trends and the poll agencies' [65] predictions during the pre-voting period. What is observed is that Google Trends' best approximation was calculated for the last month before the race at 53.05%. Though these approximations are some points below the actual result of NO, many poll agencies also predicted the NO vote to be somewhere between 55.50% and 49%, as shown in Fig. 8.
The predictions using data from Google Trends at some points gave better and at some points worse predictions of the results compared to traditional poll agencies. Despite the difference in percentage from the official result, we see that Google Trends data managed to approximate the result in a similar manner as poll agencies, and on the right side of the result, i.e. NO.

Turkish constitutional referendum (2017)
The President of Turkey, Recep Tayip Erdogan, announced that a referendum was to take place in order for the people of Turkey to decide on whether or not they agree with a set of 18 proposed constitutional amendments.
This Referendum, following the attempted coup of July 2016, was of high national and international significance, as the Turkish people were to also vote for or against the new proposed Presidential System. The wording of this referendum allowed for a simple YES Given the conflicts between Turkey and several European countries, it is of significance to examine the online search queries in the YES and NO keywords in EU countries that have a large population of Turks allowed to vote in the respective country. Thus, in order to approximate the Turkish Referendum results, we selected the countries with the most population of Turks eligible to vote and with a turnout of more than 100,000, i.e. Germany, and France, where the Turkish population were eligible to vote until April 9th [66], and the Netherlands, voting on April 5th [67]. Figure 9 depicts the percentized hits' averages of Google Trends' data for the last week and month of the pre-voting periods compared to the official results for YES [68] in Germany (653,502 voters), France (140,741 voters), and the Netherlands (116,543 voters).
The final official results of the Referendum put YES at 51.18% and NO at 48.82%, while the overseas overall results, including border gates, put YES at 59.09% [68]. In Turkey, with a turnout of 85.46% on the day of the race [69], opinion polls results were contradicting over which option, YES or NO, was on the lead during the month before the race, with YES being put at 46.25% [70], 56.5% [71], 44.47% [72], 59.4% [73], 59% [74], 60.8% [75], and 46.1% [76]. What is evident by the above is that this referendum's outcome was hard to predict. The Turkish Referendum has also resulted in national and international dispute, as ballot papers not bearing the official seal were decided to be valid on the vote count [69].
The difficulty of predicting this referendum race in Turkey was also apparent in the NO searches in the country during the pre-voting period. Though Google Trends' data were useful in approximating the referendum results in the three examined European countries with Turkish population eligible to vote, the volumes of the online search queries for 'HAYIR' (translating into NO) in Turkey were at times extremely low. During the pre-voting period, EVET was way ahead of HAYIR, at some points even reaching very high percentages. Though we cannot argue that this is a result of data bias or manipulation,Internet monitoring and censorship have been suggested to be an issue of high significance in Turkey [77]. To elaborate, NO supporters in Turkey were viewed as "siding with the coup-plotters" [78], and at some cases prosecuted or fired [79]. Thus it could be the case that many were afraid to openly express their opinion, therefore poll agencies' voting intention results were so diverse. Internet penetration in Turkey is only 53.7% [80] and scores significantly low in freedom of the press, with 65 (the worst being 100) in the index 'Press Freedom' in 2015 [81], and 71 in 2016 [80], categorized as a 'Not Free' country, while continuously declining since 2010.
Censorship and restriction of freedoms have been reported in general, especially after the attempted coup in July 2016, where social media -such as Facebook, Twitter, Youtube, Whatsapp, and Instagram-have been reported blocked, censored, or restricted in various occasions [82][83][84]. Thus the 2017 Turkish Constitutional Referendum provides an excellent example of a limitation of this method of predicting referendum results, as analyzing online search queries cannot be applied in regions with low freedom of speech and media or governmental Internet monitoring. The results in this study closely approximated the respective official referendum results, and in some cases better than traditional polls. Based on the results of this paper and the analysis of online search queries in general, what is observed is that we have entered the era where Internet has brought significant changes in terms of monitoring, analyzing, and predicting human behavior.

Discussion
In the EU, many referendums about EU matters have been conducted in the past, though during the last two decades the NO option seems to be becoming more popular [85]. This is not always attributed as an actual answer to the referendum question, but as a means of expressing dissatisfaction to the respective government [86]. Lately, poll agencies more than often fail to well approximate referendum and election outcomes. Therefore, new methods of predicting voting intentions and results have been examined [87][88][89], with online surveys taking the upturn.
The collection of referendums in this paper is notable for two reasons. At first, said referendums were significant in terms of policy and EU matters. Secondly, they present cases that will assist in future research using online queries as a polling tool, dealing with various special circumstances that arose. To elaborate, the very short pre-voting period in Greece was a good example of how to nowcast a referendum. In this case, the data had to be downloaded in very short time frames, i.e. every hour, every 4-h, and every day so as to extract robust results. In Hungary, the low turnout and interest in the referendum was depicted in the searches in Google during the pre-voting period. Furthermore, what was notable in the 2016 UK Referendum, was the effect of the murder of Labour MP Jo Cox, where the opinion shifted towards the Remain camp in the next couple of days after the murder. The UK Referendum was also significant in terms of the wording that did not allow for a simple YES/NO answer, thus the selection of keywords for monitoring the interest was not trivial. The most interesting case, though, was that of the Turkish Referendum, in a country where Internet restrictions are a serious issue. In Turkey, the use of Google Trends' data were not useful in predicting the result. Despite that, they provided good approximations for the three EU countries where significantly large Turkish population voted, i.e. in France, Germany, and the Netherlands. We thus conclude that this  . 10 Comparisons of the official final results with weekly (Google) predictions by country method cannot be applied in regions with low levels of freedom of media, Internet, and speech, or with low Internet penetration. Table 1 consists of the averages of the weekly and monthly percentized hits for YES and NO (Remain-Leave for the UK) for the examined referendum races, and their respective statistical significance for mean comparisons. Such comparisons can provide beforehand the result regarding the side of the outcome, i.e. whether the public shifts towards YES or NO, while the proximity with the final percentage is increasing as the time span approximates the ballot closing time, as discussed above.
As is evident, the results for the Scottish and the UK referendums exhibit high statistical significance on the final outcome. Regarding the Turkish Referendum, all examined comparisons are on the same side as the official results and are statistically significant, with the exception of the weekly results for France. Regarding the Italian referendum results, two of the compared datasets are statistically significant, while the rest do not statistically prove the 'NO' response.
As is suggested in Fig. 10-depicting the comparisons of the official results in each examined country with the predictions using Google Trends' data for the last week of the pre-voting period-the analysis of online search queries is a valid method of approximating the result of a referendum and the voting intentions. Note that in the Greek and the Hungarian Referendum, our results using Google Trends' data were the closest approximation to the final result, better than the poll agencies' predictions using traditional methods. Also note that for the UK Referendum, the result refers to the percentized hits of the last day of the weekly time series.
As we validated that the results of using Google Trends for voting intentions and result predictions are accurate and, in many cases, better than official polls, we argue that this method will be adopted in the near future by poll agencies and political researchers in this field, as it tackles the economic risk, the cost, and the uncertainty barriers of poll taking.
However, some limitations do exist. At first, this method could not be applied in regions where the use of the Internet is restricted, in the ones having low scorings in freedom of press, or in those with low internet penetration, where data manipulation is possible, i.e. intentional googling of a certain outcome. Furthermore, the sample can not be proven representative. Not all internet users vote and not all voters use the Internet to search for the respective referendum keywords, thus not each hit can be linked to referendum voting, and in no way is it to be implied that such a 1-1 correspondence exists. In addition, a limitation of the tool is that it has been observed that data retrieved on different time points for the same time-frame may slightly vary. Despite the above, many studies in various subjects have shown that indeed Google data can be used to analyze or predict behavioral variations [4,5,8,13,16,23,26,[90][91][92] and that empirical relationships exist between online search traffic data and human behavior [93,94].

Conclusions
In this paper, we presented a novel methodology for predicting voting intentions and referendum results using online search traffic data from Google, which was tested and validated in six referendum races. Said referendums, i.e. the 2014 Scottish Referendum, the 2015 Greek Referendum, the 2016 UK Referendum, the 2016 Hungarian Referendum, the 2016 Italian Referendum, and the 2017 Turkish Referendum, were of significance for the European Union and received wide national and international attention. Employing data from Google Trends, we estimated the respective referendum results using data on the (translated) YES and NO keywords. Our results exhibited good performance, while, in some cases, were more accurate than official polls.
Since online behavioral changes can be measured by online data [30], the potential benefit of using Google Trends' data as a poll taking method is high, especially as traditional polls did not always manage to accurately predict the outcome, and, in many cases, were not even on the winning side of the results. Google Trends has been shown to be a credible means of examining behavioral changes, as it uses the revealed and not the stated preferences of the users. Thus in regions were Internet is widely accessible and not restricted, online data are useful in analyzing and predicting human behavior in many research topics.
This method of poll taking, based on empirical relationships, could be of interest to political researchers, as it is a valid analyzer of human behavior, indicating that Internet data can give insight to behavioral variations towards political matters and election races. Future research on the subject could focus on developing more sophisticated models using online data, as well as the combination of various online sources, or the combination of online with traditional survey data. Overall, the above suggest that monitoring Internet data in general and Google Trends data in specific will be a polling tool to address the challenge of Big Data in the future.