Using Prediction Markets and Twitter to Predict a Swine Flu Pandemic
Forecast (2009)
Available from www.iccs.inf.ed.ac.uk
or
Abstract
We explore the hypothesis that social media such as Twitter encodes the belief of a large number of people about some concrete statement about the world. Here, these beliefs are aggregated using a Prediction Market specically concerning the possibility of a Swine Flu Pandemic in 2009. Using a regression framework, we are able to show that simple features extracted from Tweets can reduce the error associated with modelling these beliefs. Our approach is also shown to outperform some baseline methods based purely on time-series information from the Market.
Available from www.iccs.inf.ed.ac.uk
Page 1
Using Prediction Markets and Twit...
Using Prediction Markets and Twitter to Predict a Swine Flu Pandemic Joshua Ritterman School of Informatics University of Edinburgh j.ritterman@sms.ed.ac.uk Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk Ewan Klein School of Informatics University of Edinburgh ewan@inf.ed.ac.uk July 30, 2009 Abstract We explore the hypothesis that social media such as Twitter encodes the belief of a large number of people about some concrete statement about the world. Here, these beliefs are aggregated using a Prediction Market specifically concern- ing the possibility of a Swine Flu Pandemic in 2009. Using a regression framework, we are able to show that simple features extracted from Tweets can reduce the error asso- ciated with modelling these beliefs. Our approach is also shown to outperform some baseline methods based purely on time-series information from the Market. 1 Introduction Prediction markets are mechanisms for aggregating beliefs about concrete outcomes in the world. They are struc- tured as a betting exchange or peer-to-peer gambling sys- tem, where participants bet amongst themselves as to the outcome of specific world events���such as who will win the 2009 US election, or which Hollywood movie will achieve highest box office totals. The prediction market organizer creates ���shares��� in an event occurring. People can then buy and sell these shares at a price determined by the market. In this way, the market drives the price of a share to the mean belief of traders, interpreted as the probability of the event occurring (Gjerstad, 2006 Wolfers and Zitzewitz, 2004). In fact, it has been shown that this type of market system is able to generate an optimal global solution to the prediction problem better then any individual expert (Watkins, 2007). It is also a considerably less expensive method than alterna- tive methods, such as hiring analysts for an expert opinion on the outcome of an event, or conducting a poll. For this reason, prediction markets have become a major area of in- terest to governments, corporations and academics over the last few years. In this paper we explore the hypothesis that we can ex- tract useful information from social media and that model- ing this information will yield better results then a model constructed with information from the prediction market in isolation. We have collected almost 50 million Twitter posts (Tweets) over roughly a two month period. We will present a method using this Twitter data to forecast the closing price of a prediction market, thus showing that we can explictly model changes in the belief, as represented in a Prediction Market, from beliefs implictly represented in Tweets. 2 The Task Since prediction markets are considered to be overall analo- gous to public opinion, a model that is able to forecast such markets would be a valuable supplement to opinion polling and market research. We will focus on using the Hubdub online prediction market1 to model public belief about the possibility that H1N1 (Swine Flu) virus will become a pan- demic. On April 10th 2009, just after news about the virus became public, Hubdub posted the following question: Will Influenza A (H1N1) (aka ���swine flu���) grow into a pandemic in 2009 as feared? By modelling this market, we can thereby model public be- lief about the event in question. Unlike newswire, Twitter goes beyond factual information in providing a wealth of information about public opinion on a topic. Tweets contain rumor, commentary, opinion, and even jokes cf. Table 2. When news about H1N1 first broke, Twitter was highly active with posts about the spread of the flu, and in fact was considered by CNN to be overre- acting (CNN, 2009). However, in the weeks that followed, mainstream news coverage and Twitter activity relating to the flu subsided until a pandemic was declared on May 11th 2009. During this same time period, we collected and stored Twitter posts, and will make use of this source of data for our forecasts. 3 Related work There have been a number of studies of the effects of news on financial markets. Koppel and Shtrimberg (2004) and Devitt and Ahmad (2007) attempted to use the movement of stock markets as training data to automatically label the sentiment of news articles, implying a relation between news 1www.hubdub.com 1
Page 2
Table 1: Sample of H1N1-related Tweets. Tweets 26 Apr ya, Im over the Swine flu Tweets. Eat, drink and be merry cuz tomorrow itll be something else killing us. 29 Apr Whuh oh, the swine flu���s Patient Zero in Mexico was flanked by U.S. owned pig farms. 29 Apr No Americans have died from the swine flu, yet, but every yr 36K Americans die from the regular flu. 11 May 22 confirmed cases of H1N1 Flu in Pima County. The flu appears to be similar to seasonal flu in its impact. Take regular precautions. 12 May Free bottle of hand sanitizer at work today! No swine flu for me! 19 May Health UN to discuss swine flu vaccine: UN chief Ban Ki-moon is to meet top pharmaceutical firms to discus.. 19 May New T-Shirt in Harajuku: ���For Beautiful H1N1 Pandemic Life.��� I���m off work with sore throat, fever... - shld I buy one? 29 May thanks to tylenol for reducing my fever... now i���m shedding layers and turning on the AC sentiment and stock price movements. They had some lim- ited success with this approach, finding that it was easer to detect and label negative stories then positive ones. Pennock et al. (2000) discussed the relation between artifi- cial markets such as prediction markets and external events, looking at whether the Hollywood Stock Exchange could ac- curately predict how movies would fare in the real market place. They concluded that in this case, prediction markets were a good indicator of real world events. Lerman et al. (2008) analysed newswire text to forecast the values of the Iowa Electronic Market for the 2004 US elections. They chose four sets of features: bag-of-words features based on unigram counts ���news focus��� features that track the relative change in unigram feature counts over the preceding 3 days features for counting sentences that men- tion predefined named entities such as ���Bush���, ���Kerry��� and ���Iraq��� and finally features that label named entities ac- cording to their dependency relations. Each of the resulting models were combined with a simple internal market feature. For each day, a logistic regression classifier was trained on the features extracted from about 20 newspaper articles to label the day as closing up or down. If a person were to buy and sell on the recommendation of this system using the best feature combination (news focus + dependency), they would have on average profited about 12 dollars per share over the course of the elections. The Lerman et al. (2008) study differs from the current work in two critical ways. First, it only attempts to classify a day as being up or down, whereas we forecast the closing price for the day. Secondly we are using much more data to make our mod- els, Lerman et al. (2008) uses 20 newspapers per day we are using almost 1 million tweets. Closely related to the question we are examining, Google developed a system to predict seasonal flu activity based on search queries (Ginsberg et al., 2008). The Google Flu Trend system counts search terms that indicate influenza-like ill- ness activity. They found that there is a strong correlation between these types of search terms and actual influenza in- fection rates. This correlation was actually a more timely indicator of influenza activity then the traditional surveil- lance systems used by the US Center for Disease Control and Prevention (CDC) and the European Influenza Surveil- lance Scheme (EISS). The CDC and EISS both use viro- logical and clinical data as well as physician visits to make influenza forecasts. Using online query data, the Google system was able to predict influenza rates 1���2 weeks ahead of the publication of CDC���s US Influenza Sentinel Provider Surveillance Network. Our study looks at this same topic but from a different point of view while Google Flu Trends is forecasting influenza infection rates, we are forecasting public perceptions of a single influenza outbreak. It is im- portant to forecast the actual infection rate, but it is also useful to forecast public perception of the outbreak, since this gives policy makers insight into the public���s mood and fears, as well as valuable marketing data to companies mak- ing healthcare products. Also the Google system makes use of a proprietary corpus of search terms, whereas we are using publicly available social media to make our forecast. 4 Approach 4.1 Data Our corpus consists of Twitter posts that were collected on a daily basis by a crawler from the beginning of April 2009. The data for this experiment is a subset of the corpus, con- sisting of all Tweets collected during the period April 10th��� June 11th. This subcorpus contains 48 million Tweets, on average of 1 million Tweets per day see table 4.1. 4.2 Classification System In order to forecast the future prices of the prediction mar- ket, we decided to use the Support Vector Machine algo- rithm to carry out regression. This algorithm was chosen since it can be trained rapidly and can interpret a large fea- ture vector libSVM Chang and Lin (2001) was chosen as the implementation of the Support Vector Machine regression (SVR) algorithm. In order to make a market forecast for 2
Readership Statistics
71 Readers on Mendeley
by Discipline
6% Linguistics
by Academic Status
27% Ph.D. Student
27% Student (Master)
8% Other Professional
by Country
24% United States
13% United Kingdom
8% Germany
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime




