In our last post, my colleague Ben Pritchard referred to some pioneering predictions of the Brexit result made by Dr Xuxin Mao, an Invennt intern . It turned out to be wrong. Not to be disheartened Dr Mao has reviewed his original work and has uncovered the reasons for the poor Brexit prediction result. He has also updated his methodology to improve future predictions.
Big Data Brexit Prediction with Updated Information
The methodology for the Brexit prediction is based on statistical modelling, behavioural economics, natural language processing and Big Data analytics. Xuxin used the Topic Retrieved, Uncovered and Structurally Tested (TRUST) framework (Figure 1) to generate solid models and robust forecasts by retrieving useful information from Internet Big Data. He uncovered key decision-making factors, and tested these factors with other available data in an advanced statistical model. The TRUST framework has been used to successfully predict the 2014 Scottish referendum, the 2015 UK general election and the 2016 Scottish parliament election. It has also helped in measuring the construction output and price at ONS and UCL, and also to predict life insurance demand at L’Institut Europlace de Finance for Groupam.
The first part of the TRUST approach relies on the text mining a very large database of newspapers in print, along with their web-based counterparts, using sophisticated algorithms to represent the topics that will motivate voters. The results are summarised in Table 1 for various periods of the campaign. Xuxin found that EU immigration emerged as a key issue from 22 May to 11 June, and then again from 19 June, the same periods when the Leave side was generating momentum in the polls and Remain was trailing in the polls. While David Cameron and economy-related topics were key searches in nearly all weeks, Boris Johnson and Labour party also attracted voters’ attention frequently.
Table 1: Text Mined Topics on the EU Referendum during the EU Referendum Campaign Period
|Period||UK Economy||EU Trade||Single Market||EU Immigration||David Cameron||Boris Johnson||Labour Party|
|15 Apr-14 May||Yes||Yes||Yes||No||Yes||No||No|
|15 May-21 May||Yes||No||Yes||No||Yes||Yes||No|
|21 May-28 May||Yes||Yes||Yes||Yes||Yes||Yes||No|
|29 May-4 Jun||Yes||Yes||Yes||Yes||Yes||No||Yes|
|5 Jun-11 Jun||Yes||Yes||Yes||Yes||Yes||Yes||Yes|
|12 Jun-18 Jun||Yes||Yes||Yes||No||Yes||No||Yes|
|From 19 Jun||No||No||Yes||Yes||Yes||Yes||No|
Figure 2 Web Search Interest: EU Immigration (15 May -20 June 2016)
Figure 3 Web Search Interest: David Cameron (15 May -20 June 2016)
Figure 4 Web Search Interest: UK Economy (15 May -20 June 2016)
From Table 1 and Figure 2, Xuxin found that when voters were very enthusiastic about the immigration issue, the web search interest in this issue increased. There are two periods when the voters are interested in immigration. The first period started on 22 May and 14 June 2016. It ended 2 weeks before the referendum, days before the Jo Cox tragedy and the UKIP poster event. After a decrease in interest between 15 June and 18 June, there was renewed interest in EU immigration in the week of referendum: The web search on EU immigration in the UK increased from 36 to 81, which caused the Remain side to lose 2.7% and boosted the Leave camp by an impressive 3.5%.
From 19 June, the web search on David Cameron in the UK increased from 10 to 24, which reduced the Remain vote by 1% and increase the Leave vote by 0.7%. Meanwhile, the interest in the UK economy in the week of the referendum did not increase as fast as other important themes (from 67 to 86.4), which only boosted the Remain camp by 0.2%. In sum, the Remain lost 3.5% in the last week by while the Leave camp gained by 3.8%.
Finally, Xuxin used his statistical model to calculate the predicted outcomes for the referendum. Reported in Table 3 they show that leave will have a clear win in the referendum with a mean poll of 43.3% against Leave’s 48.6%. By following the data since our first report we could have predicted the final Brexit results.
Table 3: Projecting Referendum Voting Results
|Mean Voting Intention Rate||43.3%||48.6%|
|Swing votes Range||0-4.2%||0-3.6%|
|Final Rate Range||45.3%-50.6%||49.4-54.7%|
|Final Mean Rate||48%||52%|
Note: The predictions are based on the data available on 20 June 2016.
Where did we go wrong with the initial predictions?
In essence, Xuxin has shown that his methodology works but that there was a swing in the last days of the campaign from Remain to Leave. The lesson learned here is that the final prediction should have been made on the most up to date data. This is really a resources and process efficiency issue. With further automation, we at Invennt can see that semi-real time predictions can be made and that swings can be tracked as they happen. An exciting prospect for psephologists but also important in many other applications where real-time big data mining and interpretation is valuable.
If you wish to read Xuxin’s original report you can get to his personal website here
 The web search interest data is based on Google Trends data between 15 April and 20 June 2016 presented in a [0, 100] interval. The index of a particular term presents the percentage of search volume relative to the largest search volume happened in one day during the whole period. The larger the index is, the higher the information demanded and searches are for this term.
 The calculations of the effects of Immigration, Cameron and Economy are all based on Table 2 of the blog