Forecasting the belief of the population : Prediction Markets , Social Media & Swine Flu
Abstract
The belief of the population is very useful information but is hard to measure. Methods such as voting and polling are both expensive and slow to run. Recently prediction markets have become a popular method to aggregate information and beliefs from the population using the market price as the mean belief. The problem with these is that they have to be used directly by the population which limits their spread and have to be set up for specific questions which limits their application. We propose a novel solution to aggregate the belief of the population indirectly from social media services and overcome these problems. In particular we focus on blogs and Twitter posts which form a very noisy web-scale text collection. We extracted the beliefs by using statistical text analysis on posts which we aggregated using linear regression. We used the recent swine flu outbreak as a novel example to evaluate our models on the belief that it would turn into a pandemic. We found that it was possible to extract the belief of the population from social media and that aggregating the beliefs by linear regression performed comparably to prediction markets. Twitter outperformed blog posts showing it was more informative for this problem. Our forecast model outperformed a strong baseline for two weeks and performed better than random for five days. We concluded that it was possible to use social media as a source of the belief of the population but we needed to remove the dependency of prediction markets to allow the proposed advantages to be achieved.
Forecasting the belief of the population : Prediction Markets , Social Media & Swine Flu
Prediction Markets, Social Media & Swine Flu
Daniel Kristopher Harvey
T H E
U N I V E R S I T YO F E D I N B U R G
H
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2009
The belief of the population is very useful information but is hard to measure. Meth-
ods such as voting and polling are both expensive and slow to run. Recently prediction
markets have become a popular method to aggregate information and beliefs from the
population using the market price as the mean belief. The problem with these is that
they have to be used directly by the population which limits their spread and have to
be set up for specific questions which limits their application. We propose a novel
solution to aggregate the belief of the population indirectly from social media services
and overcome these problems. In particular we focus on blogs and Twitter posts which
form a very noisy web-scale text collection. We extracted the beliefs by using statis-
tical text analysis on posts which we aggregated using linear regression. We used the
recent swine flu outbreak as a novel example to evaluate our models on the belief that
it would turn into a pandemic. We found that it was possible to extract the belief of the
population from social media and that aggregating the beliefs by linear regression per-
formed comparably to prediction markets. Twitter outperformed blog posts showing
it was more informative for this problem. Our forecast model outperformed a strong
baseline for two weeks and performed better than random for five days. We concluded
that it was possible to use social media as a source of the belief of the population
but we needed to remove the dependency of prediction markets to allow the proposed
advantages to be achieved.
i
I would like to thank Miles Osborne for the constant enthusiasm and helpful sugges-
tions in meetings and for answering any questions I had; Sasa and Josh who were a
great help with starting out processing the data and hunting for it’s whereabouts; My
family for supporting me throughout and keeping me motivated when it was hard;
Jo for listening to my excitement and rants about this project and keeping me sane
throughout the hard work; Friends who gave me many reasons to relax when I needed
and finally everyone in Appleton Tower labs for the 12 weeks of fun and somewhere I
will miss.
ii
I declare that this thesis was composed by myself, that the work contained herein is
my own except where explicitly stated otherwise in the text, and that this work has not
been submitted for any other degree or professional qualification except as specified.
(Daniel Kristopher Harvey)
iii
1 Introduction 1
2 Background 5
2.1 The Wisdom of Crowds . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Prediction Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Related Work 17
4 Text to Beliefs 21
4.1 Text Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Raw Text to Features . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Prediction Market Prices . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Forecasting Model 30
5.1 Posts and Markets Analysis . . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Feature Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Forecasting Distance 39
6.1 Forecast Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Generalised Features . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 Performance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7 Discussion 46
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
iv
A.1 Text keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
B Market Predictions 51
B.1 Market A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
B.2 Market B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
C Results for Market B 54
D Error Variance Graphs 56
Bibliography 58
v
Introduction
We use the belief of a population in society in many different ways, such as to vote
for candidates in democracies, getting the value of a company through stock markets
and deciding which products are good and bad. These ways of measuring the belief of
a population have been established for centuries and developed as our societies have
evolved and needed a way to cooperate and solve large-scale problems. There are other
areas that would also benefit from this information such as in health, specifically in the
area of disease epidemics where they need to know the belief of the population to help
react to outbreaks. For this to be beneficial in new areas, such as disease outbreaks, a
new way to aggregate the belief of the population needs to be developed.
Disease epidemics, such as influenza, spread rapidly around the world making them
hard to predict and manage. Influenza epidemics occur annually affecting between
5% and 15% of the population [27], pandemics of new strains have also appeared
irregularly about three times a century over the last 300 years [39]. The most recent
was the outbreak of influenza A/H1N1, more commonly know as swine flu, which
started to spread at the end of 2008 and was first detected in April 2009. It was declared
a pandemic within a few months on 11th June 2009 by the World Health Organisation
(WHO). On top of managing the spread and cases of an epidemic, the public need to
know how to deal with it given their current beliefs. These beliefs will be constantly
changing and would need to be monitored in real-time to see how they are affected by
new information.
The main influence on a population’s belief is from the media, which is a very wide-
reaching term encompassing communication channels of information in any form. The
main influential forms today are TV, tadio, newspapers and the internet. Corporations
in media have an interest in keeping their reading or viewing figures high and so report
1
on big issues that the public are interested in, potentially over-reporting on stories for
longer than they actually need. To differentiate themselves from other media sources
they push the truth as far as they can, making it harder for the population to know the
real truth. This makes it hard to know what the population’s real belief is and to shift
it from what they believe to what is true.
The largest change in media recently has been the growth of the internet which
has allowed information and news to spread rapidly around the world for all to see.
More specifically are social media sites which are among the most used on the inter-
net, making up 10% of users’ time online in 2009 and increasing rapidly. They are
also the fastest growing sector on the internet which is currently third behind search,
community portals and software developers in 2009 [8]. Social media allows people
to discuss current issues both rationally and irrationally in the public eye and comment
on all published content in the news and user-created content such as blogs.
It was assumed initially that the internet would bring a long tail distribution [1] so
people would have a wide range of beliefs about a wide range of issues. This theory has
since been analysed from usage data which has shown that even though the distribution
is a long tail, it is very top heavy causing the distribution to be swayed by popularity
[23] [11]. Research into the dynamics of social media has also found that it is more
likely to make beliefs that were extreme and improbable far more popular, causing
them to move up the long tail, these have been labelled as black swan events [6]. The
black swan is a term which was popularised in a book by Taleb [37], in which he also
states that the internet causes the amplification of the improbable far more likely.
These two problems of the media pushing the truth to extremes, and causing ex-
treme beliefs to be amplified, even if they seem improbable, mean that inaccurate ar-
ticles and beliefs can easily spread the wrong message and misinformation to a lot of
the population. This can quite easily cause more panic than is appropriate making the
problem harder to manage than is otherwise needed. This has been labelled as an in-
formation epidemic leading to a field called infodemiology which tries to identify the
knowledge gap between what experts know and what most of the population believes
[12]. This is what health advisers need to take into account when they decide how to
react to disease epidemics using advertising campaigns and other ways of informing
and influencing the belief of the population.
It has been shown for problems like this, where the solution requires the use of a
vast range of information, that aggregating the solutions from a wide range of people
in the population is more likely to achieve a better solution than a group of experts. For
a group of health advisers trying to solve the problem of how to advise the population
on their own; this shows they would therefore be better trying to aggregate the belief
of the population, as is done in other areas of our society, rather than relying on their
own intuition. The problem with this is that it requires a cheap and real-time solution
to measure the belief of the population.
One solution to this is prediction markets which, similarly to stock markets, find the
overall probability from the belief of a wide range of the population and have recently
been subject to thorough research. The problem with these is that they have to be
set up for a specific question and are used directly by people, which makes them less
suitable to use on a wide range of topics. Instead of having to directly interact with the
population we propose a novel solution which analyses user-created content in social
media and extracts information to similarly aggregate the belief of the population.
We extracted the beliefs from posts on two social media services: blog and Twitter,
using features built by statistical text analysis. The beliefs were aggregated using a
linear regression model with prediction markets as a target value. We used a web scale
collection of blog and Twitter posts that was very noisy in comparison to collections
used by previous text regression papers. To evaluate the model we forecast the price of
prediction markets ahead of the training window and compared this to strong baselines.
The recent swine flu outbreak was used as a relevent example by forecasting the pub-
lic belief that the outbreak would become a pandemic and comparing this to related
prediction markets. We focused our research on the novel problem of how well our
models could forecast the belief away from a period of training data – we called this
the forecast distance. From the evaluation of our models we concluded the following
findings from our work:
• Text analysis can extract the belief from social media posts.
• Linear regression can aggregate the beliefs comparably to prediction markets.
• Our models on Twitter and blog collections outperformed a strong baseline.
• Twitter outperformed blogs using keyword expansion and multi-day counts.
• Our models outperformed the moving average up to a two week forecast dis-
tance.
• Our models performed better than random up to a five day forecast distance.
• Prediction market dependency reduces the advantages from our aggregation method.
We explain these finding through our research and results in the following chapters
as outlined below.
• Chapter 2: Will discuss the background theory behind wise crowds, prediction
markets and social media which form the base of the work in this thesis.
• Chapter 3: Reviews related work to this thesis on text mining for classification
and regression problems.
• Chapter 4: Will discuss how we set up our solution and the implementation of
the framework we used to evaluate it.
• Chapter 5: Explores the analysis of the collections and the features we developed
to model this problem.
• Chapter 6: Will evaluate the performance of the features we developed with
a varying forecast distance and discuss the feature generalisation methods we
tried.
• Chapter 7: Will conclude the findings of the thesis and discuss future directions
that our work has highlighted.
Background
In this chapter we discuss why it is better to use the belief of the population rather than
the intuition of a few experts. This leads into the use of prediction markets to aggregate
information from a population and finally why social media could be used as another
form of aggregation.
2.1 The Wisdom of Crowds
When solving a problem you need to take into account as much information as you
can to produce the best possible answer. For a small group of health advisers deciding
how best to advise the population on new disease epidemics they would need to know
as much as they can, including the current beliefs of the population, to give the best
advice. As a group they will have their combined knowledge and intuition about the
populations beliefs, which would be better than most individuals. But it has been
shown in many scenarios, that for problems like this, aggregating the solutions from
a wide range of people is better than getting a group of experts to agree on a single
solution.
This observation has been researched and studied in a range of different academic
disciplines from sociology to economics and has recently been highlighted in “The
Wisdom of Crowds” by Surowiecki [36]. The book contains a number of examples
showing where the aggregation of knowledge from a large group is better than the so-
lution from a small group or single expert. One of these examples is from the show
“Who wants to be a millionaire” where the ask the audience option chooses the correct
answer 91% of the time compared to 65% for phone a friend. Another is when the
US submarine Scorpion disappeared in 1968 where, instead of getting a small group
5
to come up with a solution to its location, a large diverse group of mathematicians,
submarine specialists and salvage men were asked for their own solutions. The combi-
nation of all their solutions was found to be 200 yards from where the submarine was
finally found. A larger-scale example was shown with the Challenger disaster in 1986
which blew up on launch, but it was not immediately known why. The stocks of four
major companies involved with design and construction all decreased with three down
3% and the fourth was down 12%, apparently singled out by the markets as the reason
for the disaster. Six months later it was found that the fourth company was responsible
for the disaster, so the aggregation of information and intuition from a wide range of
people through the stock markets had found the reason long before investigators did.
From research into the wisdom of crowds Surowiecki highlighted four properties
that a population needs to be successful in cognition problems such as the examples
given above and the problem we are solving with the belief of the population. These
four properties are Diversity where the crowd needs to have a wide range of knowl-
edge, interpretations and opinions; Independence where an individual’s solution is not
solely determined by others; Decentralisation where individual’s solutions come about
separately without direction as a whole; and Aggregation where there is some method
to turn the individual solutions into a collective one.
If health advisers could aggregate beliefs and knowledge from a diverse and in-
dependent sample of the population in a decentralised way then they would be able
to gather better information to help them advise the population on disease epidemics.
Three ways of aggregating the beliefs of the population used in society today are vot-
ing, polls and stock markets. Voting is used widely in democracy to decide on new
laws and who will have power in a given area. It is good for these tasks but is very
expensive and time consuming to setup and run, and only answers a very small set
of important topics each time. There is also a few days’ lag between voting and the
results and as soon as the voting is concluded, the results will be out of date, so there
is no way to get real-time updates with it. Polling overcomes some of these by sam-
pling a subset of the population to ask a question; this makes them cheaper and allows
them to be run more frequently but they are still limited to a few questions and are still
too expensive to become more abundant on less important topics. Stock markets are
the third way which address a slightly different domain of question, in setting a value
rather than making a choice, but they still aggregate the belief of the population in a
far cheaper way than with voting or polls. They are also updated in real-time for a far
larger range of questions (or companies) which would make them far more useful to
aggregate the belief of the population on a wider range of problems.
Stock markets on their own only set the value of a company so they would not be
good to measure the belief of the population for health advisers but there has been a
lot of research into prediction markets as an information aggregation method which
was also suggested by Surowiecki [36] as a solution for problems like this. Prediction
markets, or information markets, use the model of the stock markets to find the proba-
bility of different outcomes for a given question. These would be useful to find a value
for the belief of the population from the population itself, rather than a small group of
experts.
2.2 Prediction Markets
Prediction markets were born out of the idea of allowing students to have real expe-
rience on trading markets at the University of Iowa. In 1988 their business school
created the Iowa Electronic Markets (IEM) where students could buy up to $500 in
contracts relating to the outcome of questions, mostly on politics and financial topics.
For every question they created a winner-takes-all contract market where a $1 contract
on a given answer to a question would cost $p, where pwas the market’s expectation of
the probability that the answer would be correct. If the answer becomes true then you
would get $1 for every contract you had, and nothing if it was false. So for example if
the question was “Will there be a pandemic of H1N1 Influenza before July 1st” then if
the market currently had an expectation that the probability of there being a pandemic
was 0.6, it would cost $0.60 for a $1 contract. This forms the basis of how prediction
markets are used to aggregate information from the population. Traders, who could be
anyone, have an incentive to reveal their information as they can earn money trading
on the market, allowing their knowledge to be aggregated with others.
Along with the winner-takes-all there are also index and spread contracts [42]. An
index contract is where the price that the contract pays depends on a external contin-
uous value, such as the number of votes a candidate gets. Finally, a spread contract is
where the contract is based on the threshold of an event occurring, such as the num-
ber of votes a candidate gets or the value of a stock market. The price of a spread
contract is fixed but the spread of values in the contract can change. These different
types of markets allow different types of information to be found about the probability
distribution of the markets.
Initial work on prediction markets stated that the price of the contracts relates to
the expected value of the probability for the outcome of the contract. Analysis of this
statement found that to interpret this from a theoretically sound point of view would
require the trader’s risk aversion model [24]. The risk aversion model for a trader is the
way in which they justify the risk of their belief being correct against the cost of the
contract; this would vary for different traders based upon values such as their wealth.
If a trader’s wealth was needed to interpret the expected probability from the markets’
prices it would be very hard to use them in practice. It has been shown by Wolfers and
Zitzewitz that there is strong evidence for traders having heterogeneous beliefs so that
the average belief in a market is a useful estimate of the probability [43]. Based on this
theory it is possible to interpret the expectation of the probability of an outcome as the
mean belief of the traders. This holds unless there is an unusual risk aversion, such
as when the price tends towards 0 or 1 when the truth of the outcome may already be
known.
One of the problems with prediction markets is that under many laws they are con-
sidered betting which means they are heavily monitored and regulated which makes
setting up the money markets hard and costly. The Iowa Electronic Markets got over
this by making a deal with the government regulators that it would be used for edu-
cation and research and had a $500 cap, but no one else is able to make such a deal
now. To get over this problem there also exists markets where people play with virtual
money to be ranked against other users; these have grown in number recently as they
can be very easily run online and as such attract many more users. Economists thought
that only money markets would work to aggregate information accurately due to having
a financial incentive. Comparison between play and money markets on the same ques-
tions has found that neither was systematically more accurate than the other over 208
experiments [34]. This meant it was more the motivation than the money allowing the
markets to function. This shows that it is possible to use play markets online to make
prediction markets far more accessible as long as there is some motivation for people
to trade. There were also advantages to play markets where the wealth traders have
is accumulated from continual good predictions, whereas in money markets wealth
could have been from elsewhere. This made a difference as the risk aversion model
that traders use was thought to depend on their wealth so play markets would have a
better basing of weighting predictions by the contract prices over money markets.
There are two categories of questions: ones with discrete and one with continuous
answers. This affects how the performance of prediction markets can be evaluated
because with a discrete market, such as “Will it rain tomorrow?”, it will always have a
50% chance of being 100% wrong. Whereas with a continuous market, such as what
value a stock market will close at tomorrow, will only be wrong by the percentage
difference. This would cause reliability problems to try to predict discrete events and
act upon the outcome of the markets, but in this thesis we are only using the real-time
price of the markets as the belief of the population and not the overall outcome to make
a decision.
The theory of prediction markets to perform well in the aggregation of informa-
tion stems from the efficient market hypothesis from economics which states that a
completely efficient market would be the best predictor of an event and no other in-
formation could improve on its prediction [42]. This is really only a theory and no
market can ever be truly efficient, assuming so can be devastating as past stock market
crashes have shown. Even with a certain level of efficiency it is possible for prediction
markets to perform well at aggregating information as has been shown in a number of
real-world markets.
The research into real-world prediction markets initially focused on the IEM for
predicting presidential elections. It was found that over past presidential elections the
prediction markets had an average error of 1.5% whilst polls had 2.1%, so prediction
markets could predict the outcome of the elections with more accuracy than polling
[2]. A market called the Hollywood Stock Exchange has also found that they can
accurately predict the weekend box-office earning and predict the Oscar winner about
as accurately as an expert panel can [28]. Prediction markets have also started to be
used inside companies to predict product sales such as in Hewlet-Packard where they
could predict printer sales using an internal market better than their current process
[5]. More recently prediction markets have started to be used for health applications
with one of the first being predicting trends in influenza [31]. In this they focused
on aggregating the information from a wide range of lab scientists around the world
combining their current work and past intuition on how they occur. They found that
the prediction markets could predict influenza activity two to four weeks in advance of
reports and more accurately than is possible using historic data. In this thesis we used
two online prediction markets on swine flu questions which are labelled market A and
B:
A : What’s next for the swine flu alert level: 4 or 6?
B : Will Influenza A (H1N1) (aka “swine flu”) grow into a pandemic in 2009 as
feared?
The price trend from these two markets showed a strong correlation between key news
events and the price of the market over time. This can be seen in figure 2.1 where the
key news events in the list below can clearly be mapped on to the peaks and troughs
of the trend line for the price. The remaining movements in the price trends are due to
either information that may not have been widely known or the inherent uncertainty in
the information in the prediction market.
Labels for news events in figure 2.1:
A: 27/04/2009, Alert level 4 by the World Health Organisation
B: 30/04/2009, Alert level 5 by the World Health Organisation
C: 03/05/2009, Swine flu reported in decline in Mexico
D: 04/05/2009, UK school closed due to swine flu and UN secretary
general “no present plan to raise alert to level six”
E: 10/05/2009, First case confirmed in China.
F: 18/05/2009, Sudden rise in cases in Japan
G: 22/05/2009, Mexico lifts all restrictions from swine flu
H: 26/05/2009, Sudden rise in cases in Australia
I: 01/06/2009, First cases in 5 new countries, Luxembourg, Ukraine,
Nicaragua, Egypt and Bermuda
J: 11/06/2009, Pandemic announced by the World Health Organisation
The previous analysis and the research done here shows that there is real-world ev-
idence for the performance and ability for prediction markets to aggregate information
generally and for the specific application of health that we are investigating here.
Online prediction markets also satisfy the four properties for the population to be
wise. These are: Diversity from a wide range of users on the internet; Independence as
the users are dispersed and do not influence each other; Decentralised as the way the
users come up with their prediction are not coordinated or organised; and Aggregation
through the price of contracts in the markets. So there is a sound base for which we can
use the expectations from prediction markets as a value of the belief of the population.
One problem with prediction markets is that they still require people to interact with
them directly which limits the fraction of the population that uses them. Individual
markets also have to be set up for each question being answered which limits the
number of questions they can answer. Recently social media has become a dominant
place for people to voice their beliefs, as a wide range of beliefs are debated they could
be another source for the aggregation of beliefs to overcome these problems.
2.3 Social Media
Social media are rooted in user-generated content which has had a massive boost in
the last few years through free online services. These services allow communication
between people on the web in a variety of ways such as broadcasting content publicly
or to a closed group of contacts and sometimes a mix of the two. The content comes
in many different forms such as larger articles in blogs, short messages on Twitter
and Facebook, and now many more media formats such as video on YouTube are
starting to become very popular. On top of this almost every item on these services
can be commented on which adds even more content being produced by users. We
are focusing on text-based content in this thesis, specifically in the form of blogs and
messages on Twitter; we will refer to all the content on these services as posts from
now on.
Blog is a term derived from web log which is where people write about anything
they want with a series of posts. They tend to be on specific topics but also contain
more general posts about thoughts, experiences and beliefs that people have in society
and the world around them. The quality of blogs ranges from professional journalists
all the way down to people posting about their daily lives, thus creating a wide range
of posts with varying quality and authority on subjects. For example the types of posts
range from mentioning swine flu unrelated to health:
Look Out! Hes Behind You, Hes Got Swine Flu
There are several rules that follow an impending pandemic: first comes
break out, then comes media freak out, and finally a mention in a hit rap
song so you can rock out.
Mike Skinner aka The Streets has a brand new hit that is catchy enough to
spread faster than this Swine Flu virus itself. It seems clear that Skinner is
voicing what happens when people are uninformed and only listen to the
mass hysteria spread by the media.
. . .
to medical advice and information:
H1N1 Swine Flu Prevention Tips and Precautions
H1N1 influenza is a pandemic viral disease caused by type A influenza
viruses. The disease can spread from human to human and so we must
take appropriate preventive measures personally.
Promote Personal Hygiene
the population in posts. Something else unique to Twitter is that posts are limited to
140 characters, this is because they wanted to integrate their system with text messages
which are limited to 160 characters so the post can be 140 characters and the username
20 characters. The limit in text messages came from a combination of the limits of the
technology when it was invented and some research into the fact that most statements
or questions are less than 160 characters and postcards (and telegrams at the time) were
usually less than 150 [25]. For Twitter this limit means that people are constrained to
only posting short messages causing them to condense and summarise what they are
posting about, it also causes them to post more frequently as it takes far less time to
post than with blogs. Twitter has been shown recently to report on current events faster
than traditional media so it should be a more timely source of information, even though
it might be a lot more noisy compared to traditional media. As with blogs the range of
posts on Twitter is very wide but they are a lot shorter and more focused. For example
the following posts highlight these differences with each paragraph being a Twitter
post, some of them are related to the key news events and others are just on the topic
of swine flu.
• whats the symptoms of swine flu?
• well, I haven’t exactly started going ’oink’ yet, so maybe it’s not swine flu :o)
but tx so much 4 caring LOL :o)))
• and no it’s not the Swine flu, it’s the hangover after drinking too much yesterday.
I shouldn’t have mixed.
• WHO Raises Pandemic Flu Alert Level To Phase 5 http://ow.ly/4rlr
• “WHO boosts pandemic alert level to 5”, because 10 more people got sick?
Lmao
• WHO has raised the pandemic alert level to 5. Now is the time to prepare. See
our tips & video on www.redcross.org #swineflu (via @RedCross)
• Japan Swine Flu Cases Top 100; Schools Shut, Workers Sent Home http://tinyurl.com/pakfcm
• W.H.O. May Raise Alert Level as Swine Flu Cases Leap in Japan: The increas-
ing number of swine flu cases in Japan.. http://twurl.nl/8meiv8
• With no new cases in a week, Mexico City ends swine flu alert http://u.mavrev.com/4ze2
• @RicoLovesMAC Mexico!? Do they know about the swine flu? Or is it sorted
and cured now over there?
• WHO raises HiN1 flu level to 6: Pandemic http://bit.ly/mLmQn
• funny how the swine flu is now officially a full-blown global pandemic, but no
one seems worried anymore since the media isn’t covering it.
• so the #who has declared #swineflu a pandemic. but what does that mean, ex-
actly? should i be more concerned today than i was yesterday?
It can be seen that some of the posts are just comments which link to other sites on
the web and a lot of the posts are also questions forming parts of conversations. These
show one of the key differences from blogs as Twitter is more conversational leading
to a different type of content. So blogs and Twitter highlight two of the main different
types of social media content that exist today.
These two examples from social media show that these services contain a lot of
data of which some is potentially informative for the current belief of the population,
although this is encoded as natural language. The services also satisfy the four proper-
ties needed for the population to be wise. These are: Diversity as there is a wide range
of people posting their beliefs; Independence as users are spread out and so do not
influence each other; Decentralised as what and how people post is not coordinated
at all; and Aggregation of the individual beliefs through text mining which we inves-
tigated in this thesis. If aggregation through text mining is possible to achieve then
there are many benefits to extracting the belief from social media over prediction mar-
kets and other methods mentioned above. As people do not need to interact directly
with the aggregation system more people’s beliefs can be aggregated together than in
prediction markets, and other methods mentioned, producing a more accurate repre-
sentation of the belief of the population. The data is also general and got gathered for
a specific question to be answered, therefore it would be possible to use social media
as a source of information for a wide range of beliefs. One problem with an indirect
system is that there is no direct motivation to people posting for the aggregation of a
given type of information; this means that there is no incentive to be truthful and post
informative information. Text mining is the process of extracting information from
natural language text. This type of problem has been widely research in a range of
domains inside information retrieval. In this thesis we are focusing on the problem of
predicting a continuous value as the information to be extracted – this is know as text
regression.
relative to the previous days to take into account news staying around in the headlines.
They also tried to capture the entity phrases using keywords to focus the unigrams in
sentences and finally used natural language processing techniques to find words that a
set of keywords depended on in a parse tree, which allowed them to find entity depen-
dency outside single sentences. They also used market price history as a feature in the
model to allow for market trends over time. They found that the dependency parses
produced the best performance over all the features and that this performed better than
their baseline. One problem with using NLP to produce features is that it takes more
cpu time; in their experiments they used 50 news articles per day, but scaling up to a
web-sized collection would not be feasible.
Other papers have tried to focus on extracting the sentiment in text and turn this
into a polarity on a belief to use for classification. Kroha et al. looked at forecasting
stock markets from business news and focused on long-term trends [20]. They created
features by finding positive and negative words using complex regular expression as a
filter to count frequencies. They then used these with a Naive Bayes classifier to predict
the direction of movement in the market. They found that this method did not work
well due to having far larger numbers of positive than negative words and concluded
that they could not predict long-term trends in stock markets reliably. Similarly Kim
and Hovy looked at predicting election results based upon opinions on the web [18].
They gathered data from a web forum and used the party that each user supported to
build supervised data for the system. They used features which extracted the valence of
each mention of a party in posts and compared this to unigrams. They used a support
vector machine to train the model and classify test data. They found that their sentiment
model outperformed the baselines and unigram features.
Recently there has been interest in extracting related information about disease
surveillance using regression but from text in web search logs rather than text content
collections. This was first looked at by Eysenbach where they indirectly tracked search
terms by using clicks on adverts using Google Adwords on Canadian Google search
results [13]. The adverts were for searches with the “flu” or “flu symptoms” keywords
and advertised giving information about influenza. The click data was fitted to a linear
regression model to predict reports from both doctors and labs. They found the data
correlated better than reports from doctors but not as well as the lab reports. Similar
work was done by Polgreen et al. but using data directly from Yahoo search logs over
a 4-year period between 2004 and 2008 [30]. They tracked the frequency of searches
containing hand-picked terms related to influenza. This was done by using the fraction
of searches containing these terms over the total number of searches. The frequency
was calculated over a week of searches and they estimated the location within the US
from IP addresses. A linear regression model was then fitted to the data with a target
of the lab reports and influenza-related mortality rate. They found that using a week of
searches produced the strongest correlation with data from the labs. They also fitted a
model for each of the nine US census regions and found that even though there was still
a correlation it was not statistically relevant enough over all nine regions to be reliable.
The latest research with search logs was done at Google by Ginsberg et al. where they
used 5-years of search logs over the period between 2003 to 2008 [14]. Rather than
hand picking influenza-related search terms as in [30] they designed an automated
method to pick them. This was done by calculating how well each unique search
term could predict the number of influenza reports from doctors for each region of the
US. The variation of the search terms over the regions was also taken into account
to reward terms which showed a stronger variation as it was unlikely all the regions
would be changing at the same rate. They found that the top 45 queries produced the
best fit so they used these as the parameters of the model. They then fitted a linear
model to weekly frequencies of these terms. They tested their model in the 2007–08
flu season and found that each week they could accurately predict the reports from the
CDC. These papers showed that it is possible to extract health related information from
internet datasets.
One of the first papers to use text regression was by Yang and Chute [44]. They did
not use regression as the goal of their paper but used the weights from linear regres-
sion to find weights for important keywords on a topic. This was done using a linear
least squares fit to produce the weights which they then used to find the importance of
words. They found that their method showed improvement over all other methods they
tested showing that text regression works as a way to extract information from natural
language. Blei and McAuliffe looked at predicting movie ratings and website popular-
ity from reviews and text descriptions [3]. They did this by modifying latent dirichlet
allocation (LDA) to allow it to be trained on supervised continuous target values which
allowed the resulting supervised LDA (sLDA) model to be used for regression prob-
lems. They then used the sLDA model to select latent features from the text unigrams
optimising for the regression task. They compared this to using a normal LDA model
with a linear regression model and lasso, which is a discriminative regularisation re-
gression technique. They found that sLDA performed better than the standard LDA
method and lasso on both of the data sets.
There has also been some recent work using linear regression to directly predict
a continuous value. Schumaker and Chen looked into predicting the value of stock
markets using financial news [33]. They investigated a bag-of-words model with uni-
grams, noun phrases, named entities and proper nouns to use as features. They found
that proper nouns gave the best performance on their data set so they evaluated these
using support vector regression (SVR) to train a model on past news. They found
that their model improved over the baseline of the price momentum of the shares and
that combining their model with the baseline improved the performance even more.
One problem with their work, similarly to [21], was that they used cross-validation to
train the model on, which was not valid with time-series data. Kogan et al looked into
another financial application of predicting risk from financial reports using text regres-
sion [19]. They compared three types of unigram features which were t f , t f .id f and
the log(t f ). They trained their model using SVR with a linear kernel to compute the
weights using data for the previous five years. They found that most of their features
came to within 5% of the baseline but only one gave improved performance, when
combining with the baseline they found that performance improve for all the features.
Text to Beliefs
In this chapter we lay out the framework upon which we tested and evaluated our work.
We give details about the motivation behind the solutions and methods involved in all
parts of the framework and summarise our implementation. We also give an overview
of the datasets that we used and how they were obtained.
4.1 Text Regression
The core problem in this thesis is to obtain a real valued time-series from a dated
corpus of text documents. To work towards a solution we approached this as a linear
regression problem which allowed us to train a model which maps from a feature
vector, describing the posts each day, to a real value, quantifying the belief of the
population. A linear regression problem is where we are learning a weighted linear
combination of features which produces a function f (x) which maps a feature vector
x to a real value. Given the weight vector w and noise term w0 the problem is defined
as:
f (x) = w.x+w0 (4.1)
where we want to learn the weight vector w given a set of training vectors {xi} and
target values {yi}. The weight vector is found by minimising an error function over the
set of training vectors and target values. Once we have a weight vector from training
the model we can use equation 4.1 to predict real values given a new feature vector.
To find the weights for a given training set we chose to use support vector regres-
sion with a linear kernel. We chose to use SVR because we have a problem with a
very high dimensional feature space which support vector machines have been shown
to perform well on for text classification [17] and specifically with regression [19].
21
the posts and prediction market we needed to choose a time-resolution to convert these
to discrete time periods of feature vectors and targets. We chose to use a day resolution
as this was adequate to describe the changes in prediction markets over time.
4.2 Raw Text to Features
The feature vectors x are sets of real-valued components { f j} where each value f j is
the strength of that component j for a given day. To produce feature vectors from text
we needed to convert the raw text to real values; this was done by using frequency
counts over the terms in each post in a range of different ways that we explore in
chapter 5. Before this was possible to do we needed to process the raw data into a
more consistent form.
The raw data for blog and Twitter posts were gathered from the web using a web
crawler that was seeded with an initial list of feeds for blogs and usernames for Twitter.
The crawler gathered new posts every hour from each source and looked for new feeds
and usernames to crawl in the future. The main content of each post was extracted and
written to a file for each crawl with the date, URL and content along with any tags that
were present in blog posts. This formed the raw data that we started with and for this
thesis we used data between the start of April until June 11th. In its raw compressed
form the size of the data was 5.5GB for Twitter and 26GB for blogs so due to the size
of the datasets we could only use simple processing functions to pre-process the data
before extracting features.
Two key problems with the data were unwanted tokens such as html tags and the
date order, as they were stored by the crawled date not the posted date. To make the
data more consistent we processed each post in the raw data to clean it up and then
wrote them out to a compressed file for the day they were posted. We processed all
the data from the first date we needed data to the latest possible crawl data to make
sure we had all the posts available between the dates we needed; any crawls before the
start date would not contain posts we needed. This meant that we had a file of clean
posts for each day for both Twitter and blogs that we could more easily extract features
from. To strip the html tags we initially tried to parse each blog post using a range of
html parsers to extract the text but found this did not work well as they were slow and
due to the number of blogs posts some contained malformed html that caused errors.
So instead we stripped all html tags using a simple regular expression, this might have
been less accurate than a parser but was quicker for the large data set and it did not
Property Blogs Twitter
Size (GB) 16 4
Total posts (million) 10.6 50.0
Day average posts 203,054 954,920
Total words (million) 2,603 654
Day average words (million) 50.0 12.6
Post average words 246 13
Table 4.1: Data collections summary
of tokens in the data and may involve further text processing. In the framework was
a module for each type of feature we used which was called for each day to extract
the features from posts in that time period and return them along with the number of
posts. As the number of posts varied each day, the feature counts were normalised by
the total number of posts that day using equation 4.5. The features were also filtered
by a threshold for a given number of counts each day; from preliminary experiments
we set the threshold at 20 but also investigated how performance varied with this for
given features in chapter 5.
f j =
term frequency
total posts
(4.5)
Linear regression also needed a uniform vector space over all the days, so the compo-
nents of the feature vectors were combined to produce a uniform feature space. The
feature vectors were expanded into this space adding components of zero value which
they did not have before. The feature vectors were also filtered through a threshold
on the number of non-zero days to be combined into the uniform space; from prelimi-
nary experiments we set this threshold at 2. This was to make sure that the space had
features that varied over multiple days, as a feature with only a single day non-zero
value would not be any use in regression. To use these features vectors in SVR they
also needed to be standardised so the values over each component had a mean of 0
and standard deviation of 1; this is so that the range of different features in the space
will be comparable, allowing the SVR to perform better. This was done by finding the
mean µ and standard deviation σ2 of each component then standardising each value by
equation 4.6. The resulting features generated for each day could be used along with
target data for training and evaluation with SVR and were written out to a file to be
used in experiments.
Figure 4.2: Feature generation framework
f˜ j =
f j −µ
σ2
(4.6)
This framework was created in Python to allow it to be prototyped quickly and to be
more flexible to load modules for different features. Due to the fact that each day was
independent we also processed multiple days in parallel to speed up feature generation
with the final stage to produce the features vectors done in series – this was done using
the Parallel Python module. A diagram of the feature generation framework is shown
in figure 4.2.
4.3 Prediction Market Prices
Prediction market prices were used as the target values for regression. We used two
similar questions related to the belief that H1N1 influenza will turn into a pandemic
from the online market site Hubdub. On the site they have markets with discrete ques-
tions which can have a yes or no answer or multiple outcomes. Users start off initially
with H$1000 Hubdub dollars when they register with the site and get H$20 for each
day they visit the site, giving them incentives to trade and to return to the site in the
future. They use contracts that are similar to winner-takes-all but with the price the
contract pays depending on the final percentage for the answer that was correct. So for
example if you bought a 1$ contract for an answer that had a probability in the market
of 0.4, if this was the correct outcome with a final probability of 0.8 then the the price
you get is:
H$1+(0.8−0.4)×H$1 = H$1.4
The price of the markets is calculated using Hansons market scoring rule [15] which is
used to decide how to combine forecasts from the contracts traders had taken out. Han-
son’s rule is based upon logarithmic market scoring which he found has a number of
desirable properties for prediction markets and is one of the most commonly adopted.
The market prices were accessible from Hubdub via a web service API which al-
lowed us to get an xml file containing the price changes for every time a new forecast
was added to the market. To use these as targets with regression they need converting
to discrete time periods as with the feature vectors. To do this we grouped the price
changes into day time-frames and found the average of the price over each given day to
use for the target values. To process the xml file we created a Python script to fetch the
data for the desired market and then produce the target values for each day the market
was running, this then wrote out the target values to a file to be used for regression.
The two questions that we used from Hubdub are shown below and are labelled
market A and B.
A : What’s next for the swine flu alert level: 4 or 6?
B : Will Influenza A (H1N1) (aka “swine flu”) grow into a pandemic in 2009 as
feared?
The relation between them was that alert level 6 is the level of a pandemic in the WHO
alert levels, so answer level 6 to market A and answer yes to market A are asking the
same question.
Even though prediction markets are known to be valid it is not known how many
forecasts they need to make reliable predictions. To find out how these two markets
differ we analysed the market data over time – the statistics can be seen in table 4.2.
It has been found that around 20 active traders can produce reliable results from pre-
diction markets [7] [26] and on our markets if only around 10% of the traders gave
multiple forecasts then they would have made at least 15 trades each, or two per week
over the duration of the markets. So both the markets we used had a good number of
active traders giving good evidence that they were reliable. Still the difficulty in know-
ing how valid the markets are can be seen here, as market A had a higher volume but
fewer forecasts and market B had a lower volume but more forecasts, so it was hard
to know which was more reliable than the other given they were over a similar time
period. For this reason we used both of these markets in experiments and evaluation in
this thesis.
4.4 Training and Evaluation
To evaluate the model we had to split the data into a training and test set; due to using
time-series data it was not possible to uniformly split the data randomly into two sets
Market A B
Hubdub Id 39738 40354
Duration (days) 48 42
Total Volume H$1,699,722 H$1,207,542
Total Forecasts 1193 2147
Total Traders 328 457
Day Average Volume H$35,411 H$28,751
Day Average Forecasts 24.85 51.12
Table 4.2: Prediction markets statistics
as we had to preserve the date order of the data. This also meant it was not possible to
use techniques such as cross-validation to use more data for testing as this would not
preserve the date order as before. Instead we used a method similar to cross-validation
but preserving the date order. To do this we defined a single day on which to test each
iteration and defined the training set as all the days, or a fixed window of days, up to
the day before we were testing. We then iterated around every possible day to use for
testing and averaged the error over all the iterations to use as the final error for a given
model. To predict into the future away from the training data we left a window of a
number of days between the training window and the test day, the rest of the method
remained the same. We implemented this in Python and used the libSVM [4] module
to run SVR on each of the training and test iterations.
To compare the prediction market prices to the predicted prices we used the mean
squared error (MSE) which is the most common way to compare real-valued time-
series models. The MSE is the squared difference between the market price and the
predicted price for each day summed over all the days predicted; this is shown in
equation 4.7 with f (xi) as the predicted price and yi as the market price.
MSE =
n
∑
i=1
( f (xi)− yi)2 (4.7)
To evaluate the models we produced baselines for comparison. The baseline for
predicting one day after the training window was a two-day moving average over the
prediction market, using the average of the two previous days’ prices as the prediction
for the next day as shown in equation 4.8:
was available in the posts to be extracted. As with the prediction markets we plot-
ted the frequency of a few hand-picked keywords relating to the H1N1 outbreak over
both the blog and Twitter posts on the same time period as the prediction markets to
see what information they contained; this can be seen in appendix A.1. This showed
that there was a sudden outburst of posts on both blogs and Twitter when the WHO
announced the alert levels 4 and 5; after this the frequency was relatively a lot lower,
with the only other large peak being when the pandemic (alert level 6) was announced.
Between these two points in time there is some variation that correlates with events
from the prediction markets, but as was highlighted before some of the keywords have
a frequency peak for both peaks and troughs in the prediction markets.
5.2 Feature Exploration
The analysis of the prediction markets and posts showed why the the keyword features
we initially used for blogs and Twitter posts, which just counted the frequency of
the list of hand picked keywords, did not perform well. This was because the list of
keywords did not extract much differentiating information from the posts and none
to differentiate between peaks and troughs from events. For example “flu” could be
used in both positive and negative contexts in phrases in the posts. To try to extract
more information we used unigram features over all Twitter posts and found that this
improved performance over keyword features but was still not close to the baseline.
The improved performance with unigrams showed it was extracting more information
but the reason it was still not as good compared to the baseline can be seen by looking
at the fraction of posts that contained the keywords we used. Over Twitter posts the
fraction of posts that contained “flu” peaked at 2.5% and most of the time it remained
around a factor of 10 lower at 0.2%. Blogs were quite different as they peaked at 97%
for “flu” and remained at a higher fraction of around 4% for the rest of the time. So
most of the posts in the blog and Twitter collections which were used to construct the
unigrams were nothing to do with the epidemic. This means that the unigram features
contained a lot of noise so the signal we wanted the SVR to pick out was being lost.
To try to overcome the problem of noise from irrelevant posts we used the keywords
to filter posts that were used to create the unigrams. This would remove some of the
noise from unrelated posts by excluding them. Using this feature on the Twitter posts
we found that it improved performance on market B but not market A, so even though
we were filtering some noise out, it was not an ideal way to do so. A summary of the
Period (days) Market A Market B
1 111.90 108.31
2 105.39 70.54
3 101.76 84.09
4 100.78 67.68
5 97.93 59.64
6 96.94 68.18
7 91.58 69.61
Table 5.3: Twitter time-series errors
time-series information in the Twitter posts we achieved the lowest error out of all the
features we had tried.
So far in all the experiments we used all the available data up to the day we were
testing on to build the model; we did not know how the amount of data we used for
training affected the performance of the model. This could have affected the perfor-
mance because using too old data would not be relevant to new days and using too
little data would not build a general enough model to forecast days ahead of the train-
ing window. To see how this affected the performance we trained a model on different
window sizes of training data to get an error for each. Another problem related to this
was overfitting which could have been caused by using too little or too much data, as
with the window size, which would cause the model to only generalise information
near to the day we were predicting or fit too well using more data. To investigate
overfitting at the same time we also calculated the training error on the same model
by using each day of training data to test with as well as train; the resulting two plots
can be seen in figure 5.1. To check for overfitting we looked for points when the train-
ing error decreases but the test error increases, however, this was hard to interpret as
it was not known which direction on the number of days would cause the overfitting
to happen. However, there did seem likely to be overfitting using fewer than six to
seven days of data due to a larger decrease in the test error once the training error has
increased. Using more than six to seven days of data appeared to be more likely not
to be overfitting as the training error stays more consistent and the test error increases
mostly linearly. So the performance for a given window size must be in an area that is
not overfitting, which was above seven days, so the best number of days’ training data
was shown to be seven. We used the seven day training window to evaluate the best
Feature Source Label Market A Market B
Baseline - - 72.16 87
Keyword Expansion (5-day count) Twitter A 43.58 63.64
Keyword Expansion Blogs B 75.33 80.35
Tags Blogs C 71.14 80.69
Table 5.4: Final features performance
happen if this was used as the target in the regression model. The results from this
can be seen along side the results for market A and B in table 5.2. It was found that
the combined market gave a lower error on all of the features and also gave a lower
baseline error. As can be seen in figure 2.1, the averaged market price varied less which
made the moving average predictions better. This meant that the lower error from the
features could also be due to the lower variance in the prices rather than less noise.
As this was hard to interpret without further research, which was outside of the scope
of this project, we remained comparing the performance on both markets in all our
experiments.
5.3 Summary
In this chapter we explored a range of features to extract information about the belief
of a population from text data in blog and Twitter posts. The final performance, trained
on a seven-day window, for the best features we developed are shown in table 5.4. The
keyword expansion feature performed best using a five-day count for Twitter posts and
just one day for blogs posts, both of which outperformed the baseline on both markets.
Even though the tags were not directly from text data in the posts, they performed best
for blogs so are shown here. These three features are labelled A, B and C as shown in
table 5.4 to be referenced in the rest of this thesis.
We found that Twitter posts had the lowest error in forecasting on both of the
markets whilst using the five day count. This showed that Twitter was a better source
of information using the types of features we had here. So using simple features on
a larger number of small posts works better than using simple features on a smaller
number of large posts.
There was also a small lag in all the predictions made that can be seen in the
graphs of predictions over time in appendix B. The lag is from the delay between
Forecasting Distance
In this chapter we explore forecasting the belief of the population multiple days ahead
of the training window. This will lead into exploring methods to generalise the fea-
tures for the model. Finally we evaluate how well the features we developed perform
multiple days ahead.
6.1 Forecast Performance
In chapter 5 we predicted a value for the price of the prediction market to forecast the
price for the day after the training data to evaluate performance. This did not take into
account how general the model was for forecasting the value of belief multiple days
away from the training window, or the forecast distance. This is what we will explore
in this chapter.
The aim of this thesis was to see if it was possible to extract the belief of the
population from social media posts to see if it would be helpful to use in the health
sector. To do this we created a linear regression model which relied upon prediction
markets for the target values. This meant that the models created so far can extract
information about the belief of the population, but are limited to the periods in time
that the prediction markets are running. To use our model outside of the time periods
prediction markets are run, and to not reply on them, we would need to know how well
the models we created generalise to predicting multiple days away from the training
data.
We first evaluated how well the features we created in chapter 5 performed on the
seven-day window we used in that chapter. These three features are labelled A, B and
C were summarised in chapter 5. This was done by training a model on seven days of
39
0
100
200
300
400
500
600
700
800
900
0 5 10 15 20 25 30
M
S
E
Prediction Ahead (days)
Feature A
Feature B
Feature C
MA Baseline
Random Baseline
Figure 6.1: Performance of varying the forecast distance on market A
data then made predictions with a 1 to 28 days’ forecast distance. The results can be
seen in figure 6.1 for market A and because the results in market B are very similar they
are shown in figure C.1 in the appendices. These showed a general linear relationship
between the forecast distance and performance with a few peaks and troughs for both
markets. The peaks and troughs in the graph were due to different sections of the
prediction markets having a different set a values for the seven-day window, which
made some fit better than others which caused the drops in error as the forecast distance
was varied. This can be seen because the moving average baseline displays the same
pattern as the features in both graphs.
For a forecast distance of 15 days features A and B performed about the same as
the moving average baseline, but further away than this the moving average performed
better. Feature C performed better and outperformed the moving average baseline till
around 15 days when the performance was similar. Features A and B along with the
moving average baseline performed better than random until about five days ahead and
feature C performed better than random until 11 days ahead.
We ran further experiments using a window size between 7 and 17 days to see if
the window size affected the performance of the features when varying the forecast
distance. We found that there was a variation between different windows sizes but they
all gave approximately the same performance at a lower forecast distance and gave
worse performance the larger the window and the larger the forecast distance. This
resulted in a seven-day window still being the best to use for training and evaluation;
this is the window size we used for the work in the rest of this chapter.
Even though the performance looked good for the features, the error at five days
ahead was just below 300 which was about 17 percent off on average at predicting the
value of the prediction market; this was about five times worse than predicting one
day ahead. To improve this performance the features needed to be more generalised to
perform better on a wider range of unseen data away from the training set.
6.2 Generalised Features
The problem we had was that the features created for predicting the next day ahead
in chapter 5 were not general enough to perform well when predicting with a larger
forecast distance. To improve the performance we needed features which would cap-
ture more generalised information from the training data to predict days ahead better
than just days very close to the same period as the training data. We explored a few
different features on the Twitter data set to try to overcome this problem.
We started with stemming to try to get around any changes in the way words were
used over time, but keeping the general meaning of the words in each of the features.
We used the keyword expansion feature as a base to start from and stemmed both
the keywords and the words in each post. The Porter2 algorithm from the snowball
stemmer [32] was used for stemming using the Python binding inside our processing
framework. We found that this improved performance over the normal keyword expan-
sion with a forecast distance of about five days but performed worse further ahead –
this can be seen in figure 6.2. This showed that stemming improved the generalisation
of the features a small amount in the short term but not in the long term.
To try to improve performance in both the long and short term we focused on a
range of features which generalised the meaning of the text by mapping words to a
higher semantic meaning. This would hopefully produce more generalised features
by reducing the feature space to a small set of more general features. The first of
these was named entity recognition (NER) which used clues in the text such as part
of speech tagging (POS) to tag each word as a person, organisation, location, time,
money, percent, date or other. These provided far higher semantic topics to map the
words to and created a smaller subset of features. We used the c&c tagger [10] to tag
each word in the posts with a POS and NER tag and we then used keyword expansion
as before but used the NER tags rather than the unigrams to combine the features. This
produced features such as “flu loc” if the feature was originally “flu mexico”. We
0
100
200
300
400
500
600
700
800
900
1000
0 5 10 15 20 25 30
M
S
E
Prediction Ahead (days)
Keyword Expansion
Stemming
NER
Sentiment
MA Baseline
Random Baseline
Figure 6.2: Performance of generalised features in market A
found that this performed worse than the stemming feature for most forecast distances
other than around 17 days, but the error there was far above the random baseline so
this was more likely due to noise than better performance. These results can be seen in
figure 6.2. The performance was worse probably due to the fact that it was using a very
small set of semantic terms to map the words to, which lost a lot of the information
from the posts which would be needed to differentiate between the different trends.
To try to improve the performance by keeping more information from the posts,
we tried sentiment analysis on text content in an attempt to classify if each post had
a positive, negative or neutral sentiment towards keywords that were in it. This built
upon the NER features we used before and kept more information from the text by ap-
pending “+”, “#”, or “-” to the feature. This produced features such as “flu location +”
when, for example “mexico” was mentioned along with “flu” in a positive context.
We calculated the sentiment by using a method similar to Wilson et al. [41] and used
the subjectivity lexicon of over 8,000 words from their work. The subjectivity lexi-
con mapped words that affect the sentiment of a phrase to whether each was positive,
negative or neutral; it also contained the POS location these words must appear in.
We used this list to create a sentiment rating for each post that was simply summing
up -1 for each negative word and +1 for each positive word found. If the rating was
positive then we assigned “+”, if it was negative we assigned “-” and if it was neutral
we assigned “#”. This allowed us to efficiently use sentiment analysis on the posts but
would not have been as accurate as dependency parsing which they used in the paper.
Models Error Equation
Market A (A, B) 37.08n+46.22
Market A (C) 16.45n+54.23
Market Stemming (A, B, C) 40.58n+55.53
Table 6.2: Equations to estimate the error in forecasts
Models Variance Equation
Market A (A, B) 23414.5n−33470
Market A (C) 13630.17n−22081.2
Market B 26860.01n−5912.94
Table 6.3: Equations to estimate the variance in forecasts
error had a linear relationship with the number of days ahead predicted, it was possible
to estimate the error for a given number of days. We created an estimate using a least
squarest linear regression fit on the training errors and the forecast distance n. For
market A features A and B had a slightly different error relation to feature C so we
modelled these separately, and for market B we modelled all three features together.
The equations to estimate the error for these can be see in table 6.2.
Along with looking at the average MSE we looked at the variance of the MSE to
see how much the error varied for a given forecast distance. This produced graphs with
turning points in similar places to figure 6.2 and can be see in appendix D. Equations
were also made for the variance, similarly to the error, so that we could estimate the
variance of the error for a given forecast distance. The matching variance equations to
those we found for the error are shown in table 6.3.
These three equations now allowed us to estimate the variance and error for pre-
dicting a number of days ahead of the training set.
6.4 Summary
In this chapter we have explored the performance of predicting for a given forecast
distance from the training set on the features from chapter 5 and explored how to
improve on this performance. We found that all of the methods we tried to improve,
the prediction performance did not noticeably help and so evaluated the performance
on the features without modification.
Discussion
7.1 Conclusion
Our aim in this thesis was to find if it was possible to forecast the belief of the pop-
ulation from social media content to help health advisers make decisions in disease
epidemics. We found that it was possible to extract the belief of the population from
social media content and aggregate the beliefs through linear regression, which proved
that it contained the required information. The information we extracted was not that
much help to advisers due to limitations in the performance of the forecast distance on
our final models.
We found that both Twitter and blog posts could beat the performance of the hard
baselines but different features were needed to extract relevant information from each.
Twitter posts performed better than blog posts on both of the markets which showed
that Twitter posts were more informative for the way we approached this problem.
Attempts to increase performance on a larger forecast distance, by generalising the
features in an efficient way given the size of the collections, did not work. We also
found both blog and Twitter posts performed similarly at the same forecast distance.
We found that the features we developed allowed us to forecast the value of the predic-
tion market better than a random baseline with a forecast distance of up to five days.
This meant that we were dependent upon the prediction markets to extract the infor-
mation out of the posts and could only predict usefully up to five days after the end
of the prediction markets. The dependence on the prediction markets meant that this
method to aggregate information to help heath advisers was not better than directly
using prediction markets as the aggregation method. For social media to be a source
of information to improve on prediction markets then the dependence would need to
46
News Events
The next section shows the research of key swine flu-related events on top of the Twitter
and blog data. For the fraction of posts the value is displayed on a log scale to make it
easier to view. The events listed below are labelled with the corresponding letters on
the graph.
A: 27/04/2009, Alert level 4 by the World Health Organisation
B: 30/04/2009, Alert level 5 by the World Health Organisation
C: 03/05/2009, Swine flu reported in decline in Mexico
D: 04/05/2009, UK school closed due to swine flu and UN secretary general “no
present plan to raise alert to level six”
E: 10/05/2009, First case confirmed in China.
F: 18/05/2009, Sudden rise in cases in Japan
G: 22/05/2009, Mexico lifts all restrictions from swine flu
H: 26/05/2009, Sudden rise in cases in Australia
I: 01/06/2009, First cases in 5 new countries, Luxembourg, Ukraine, Nicaragua,
Egypt and Bermuda
J: 11/06/2009, Pandemic announced by the World Health Organisation
49
ppendix
A
.
N
ew
s
E
vents
50
A.1 Text keywords
−6
−5
−4
−3
−2
−1
0
11/04 18/04 25/04 02/05 09/05 16/05 23/05 30/05 06/06 13/06 20/06
L
o g
a r
i t
h m
i c
F
r a
c t
i o
n
o f
P
o s
t s
Date
A
B
C
D
E
F
G H
I
J
Mexico - Twitter
H1N1 - Twitter
Flu - Twitter
Pandemic - Twitter
Influenza - Blogs
Flu - Blogs
Pandemic - Blogs
ppendix
B
.
M
arketP
redictions
52
B.1 Market A
20
30
40
50
60
70
80
90
100
25/04 02/05 09/05 16/05 23/05 30/05 06/06 13/06
M
a r
k e
t E
x p
e c
t a
t i
o n
Date
Market Price
Moving Average Base Line
Feature A
Feature B
Feature C
Results for Market B
This appendix contains the results for market B from chapter 6 as they were similar to
the results from market A which were analysed in that chapter.
Figure C.1 shows the performance of varying the forecast distance for the three
features summarised in chapter 5. The shape of the graph is identical to that of market
A but the values differ slightly. Feature C also performs similarly to the other features
in this graph which was opposite to market A where it performed better.
0
100
200
300
400
500
600
700
800
0 5 10 15 20 25 30
M
S
E
Prediction Ahead (days)
Feature A
Feature B
Feature C
MA Baseline
Random Baseline
Figure C.1: Performance of varying the forecast distance on market B
Figure C.2 shows the performance of the generalised features for predicting multi-
ple days ahead in chapter 6. This shows the results for market B give a similar shape
to market A but the results have slightly larger variance and larger values.
54
0
100
200
300
400
500
600
700
800
900
0 5 10 15 20 25 30
M
S
E
Prediction Ahead (days)
Keyword Expansion
Stemming
NER
Sentiment
MA Baseline
Random Baseline
Figure C.2: Performance of generalised features in market B
Table C.1 shows the performance of the generalised features as averages over the
first 5, 6 to 10, 11 to 15 and 16 to 20 days ahead. As with market A it can be seen
that none of the features performs better than normal keyword expansion. Stemming
performs best out of the three methods tried and the sentiment analysis performed
worse than NER.
Feature 1 to 5 6 to 10 11 to 15 16 to 20
Moving Average Baseline 164.8 361.69 491.96 439.54
Random Baseline 294.34
Keyword Expansion 188.31 365.41 419.29 382.37
Stemming 188.54 379.02 447.04 418.57
NER 260.38 501.25 502.9 355.87
Sentiment 314.31 545.43 550.88 405.45
Table C.1: Average forecast performance for market B
Error Variance Graphs
This appendix shows graphs of the variance in the error from chapter 6.
Figure D.1 shows the variance in the error for market A and it showed that the
turning points are similar to that of the error graph for this market.
0
100000
200000
300000
400000
500000
600000
0 5 10 15 20 25 30
M
S
E
Prediction Ahead (days)
Feature A
Feature B
Feature C
MA Baseline
Random Baseline
Figure D.1: Variance in the error of varying the forecast distance on market A
Figure D.1 shows the variance in the error for market B and it showed that the
turning points are similar to that of the error graph for this market.
56
0
100000
200000
300000
400000
500000
600000
0 5 10 15 20 25 30
M
S
E
Prediction Ahead (days)
Feature A
Feature B
Feature C
MA Baseline
Random Baseline
Figure D.2: Variance in the error of varying the forecast distance on market B
[1] C. Anderson. The Long Tail: How Endless Choice is Creating Unlimited De-
mand. Random House Business Books, 2006.
[2] J. Berg, R. Forsythe, F.D. Nelson, and T. Rietz. Results From a Dozen Years
of Election Futures Markets Research. Handbook of Experimental Economic
Results, 1:486–515, 2001.
[3] D.M. Blei and J. McAuliffe. Supervised Topic Models. Advances in Neural
Information Processing Systems, 20:121–128, 2008.
[4] C. Chang and C. Lin. LIBSVM: a Library for Support Vector Machines, 2001.
Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
[5] K.Y. Chen and C.R. Plott. Information Aggregation Mechanisms: Concept, De-
sign and Implementation for a Sales Forecasting Problem. In Lee Center Work-
shop, 2002.
[6] V. Chittoor. The long tail and the black swans. http://technology.inc.com/
internet/articles/200906/chittoor.html, 2009.
[7] J.D. Christiansen. Prediction Markets: Practical Experiments in Small Markets
and Behaviours Observed. The Journal of Prediction Markets, 1(1):17–41, 2007.
[8] The Nielsen Company. Global faces and networked places, 2009. Source
http://blog.nielsen.com/nielsenwire/wp-content/uploads/2009/03/
nielsen_globalfaces_mar09.pdf.
[9] Compete.com. Site profile, twitter.com, 2009. http://siteanalytics.
compete.com/twitter.com/.
58
[10] J. Curran, S. Clark, and J. Bos. Linguistically Motivated Large-scale NLP with
C&C and Boxer. In Annual Meeting-Association for Computational Linguistics,
page 2, 2007.
[11] A. Elberse. Should You Invest in the Long Tail. Harvard Business Review,
86(7/8):88–96, 2008.
[12] G. Eysenbach. Infodemiology: The Epidemiology of (mis) Information. The
American Journal of Medicine, 113(9):763–765, 2002.
[13] G. Eysenbach. Infodemiology: Tracking flu-related Searches on the Web for
Syndromic Surveillance. In AMIA Annual Symposium Proceedings, page 244.
American Medical Informatics Association, 2006.
[14] J. Ginsberg, M.H. Mohebbi, R.S. Patel, L. Brammer, M.S. Smolinski, and L. Bril-
liant. Detecting Influenza Epidemics using Search Engine Query Data. Nature,
457(7232):1012, 2009.
[15] R. Hanson. Combinatorial Information Market Design. Information Systems
Frontiers, 5(1):107–119, 2003.
[16] Y. Huang. Support Vector Machines for Text Categorization Based on
Latent Semantic Indexing. Artigo obtido do site http://bach. ece. jhu.
edu/gert/courses/774/2001/lsa. pdf, u´ltima visita em, 1(07), 2004.
[17] T. Joachims, C. Nedellec, and C. Rouveirol. Text Categorization with Support
Vector Machines: Learning with Many Relevant Features. In Machine Learning:
ECML-98 10th European Conference onMachine Learning, Chemnitz, Germany.
Springer, 1998.
[18] S.M. Kim and E. Hovy. Crystal: Analyzing Predictive Opinions on the Web. In
Proceedings of the Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning (EMNLP-CoNLL),
2007.
[19] S. Kogan, D. Levin, B.R. Routledge, J.S. Sagi, and N.A. Smith. Predicting Risk
from Financial Reports with Regression. In Proceedings of Human Language
Technologies: The 2009 Annual Conference of the North American Chapter of
the Association for Computational Linguistics, pages 272–280. Association for
Computational Linguistics, 2009.
[20] P. Kroha, R. Baeza-Yates, and B. Krellner. Text Mining of Business News for
Forecasting. In Database and Expert Systems Applications, 2006. DEXA’06.
17th International Conference on, pages 171–175, 2006.
[21] V. Lavrenko, M. Schmill, D. Lawrie, P. Ogilvie, D. Jensen, and J. Allan. Mining
of Concurrent Text and Time Series. In KDD-2000 Workshop on Text Mining,
2000.
[22] K. Lerman, A. Gilder, M. Dredze, P.A. Philadelphia, and F. Pereira. Reading the
markets: Forecasting Public Opinion of Political Candidates by News Analysis.
In Proceedings of the 22nd International Conference on Computational Linguis-
tics (Coling 2008), pages 473–480. Coling 2008 Organizing Committee, 2008.
[23] F. Manjoo. Long tails and big heads, 2008. http://www.slate.com/id/
2195151/.
[24] C.F. Manski. Interpreting the Predictions of Prediction Markets. Economics
Letters, 91(3):425–429, 2006.
[25] M. Milian. Why text messages are limited to 160 characters,
2009. http://latimesblogs.latimes.com/technology/2009/05/
invented-text-messaging.html.
[26] University of Iowa. Influenza prediction markets, 2007. http://
fluprediction.uiowa.edu/fluhome/FAQ.html.
[27] World Health Organisation. Influenza fact sheet, 2003. http://www.who.int/
mediacentre/factsheets/fs211/en/.
[28] D.M. Pennock, S. Lawrence, C.L. Giles, and F.A˚. Nielsen. The Real Power of
Artificial Markets. Science, 291(5506):987–988, 2001.
[29] D. Peramunetilleke and R.K. Wong. Currency Exchange Rate Forecasting from
News Headlines. In Proceedings of the 13th Australasian database conference-
Volume 5, pages 131–139. Australian Computer Society, Inc. Darlinghurst, Aus-
tralia, Australia, 2002.
[30] P.M. Polgreen, Y. Chen, D.M. Pennock, and F.D. Nelson. Using Internet Searches
for Influenza Surveillance. Clinical Infectious Diseases, 47(11), 2008.
[31] P.M. Polgreen, F.D. Nelson, G.R. Neumann, and R.A. Weinstein. Use of Pre-
diction Markets to Forecast Infectious Disease Activity. Clinical Infectious Dis-
eases, 44(2):272–279, 2006.
[32] M.F. Porter. Snowball Stemmer, 2001. http://snowball.tartarus.org/.
[33] R.P. Schumaker and H. Chen. Evaluating a News-aware Quantitative Tader: The
Effect of Momentum and Contrarian Stock Selection Strategies. Journal of the
American Society for Information Science and Technology, 59(2):247–255, 2008.
[34] E. Servan-Schreiber, J. Wolfers, D.M. Pennock, and B. Galebach. Prediction
Markets: Does Money Matter? Electronic Markets, 14(3):243–251, 2004.
[35] A.J. Smola and B. Scholkopf. A Tutorial on Support Vector Regression. Statistics
and Computing, 14(3):199–222, 2004.
[36] J. Surowiecki. The Wisdom of Crowds: Why the Many are Smarter than the Few
and how Collective Wisdom Shapes Business, Economies, Societies and Nations.
Doubleday, 2004.
[37] N. Taleb. The Black Swan. Random House New York, NY, 2007.
[38] Paul Verna. The blogosphere: A mass movement from grass roots, 2008. http:
//www.emarketer.com/Report.aspx?code=emarketer_2000494.
[39] Wikipedia. Influenza pandemic, 2009. [Online; accessed 7-July-2009].
[40] C. Williams. Support Vector Machines, 2008.
[41] T. Wilson, J. Wiebe, and P. Hoffmann. Recognizing contextual polarity in phrase-
level sentiment analysis. In HLT ’05: Proceedings of the conference on Human
Language Technology and Empirical Methods in Natural Language Processing,
pages 347–354. Association for Computational Linguistics, 2005.
[42] J. Wolfers and E. Zitzewitz. Prediction Markets. Journal of Economic Perspec-
tives, 18(2):107–126, 2004.
[43] J. Wolfers and E. Zitzewitz. Interpreting Prediction Market Prices as Probabili-
ties. National Bureau of Economic Research, 91, 2006.
[44] Y. Yang and C.G. Chute. A Linear Least Squares Fit Mapping Method for In-
formation Retrieval from Natural Language Texts. In Proceedings of the 14th
conference on Computational linguistics - Volume 2, pages 447–453. Association
for Computational Linguistics Morristown, NJ, USA, 1992.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


