Sign up & Download
Sign in

RT to Win ! Predicting Message Propagation in Twitter

by Miles Osborne, Victor Lavrenko
Artificial Intelligence (2011)

Abstract

Twitter is a very popular way for people to share informa- tion on a bewildering multitude of topics. Tweets are propa- gated using a variety of channels: by following users or lists, by searching or by retweeting. Of these vectors, retweeting is arguably the most effective, as it can potentially reach the most people, given its viral nature. A key task is predicting if a tweet will be retweeted, and solving this problem fur- thers our understanding of message propagation within large user communities. We carry out a human experiment on the task of deciding whether a tweet will be retweeted which shows that the task is possible, as human performance levels are much above chance. Using a machine learning approach based on the passive-aggressive algorithm, we are able to au- tomatically predict retweets as well as humans. Analyzing the learned model, we find that performance is dominated by so- cial features, but that tweet features add a substantial boost

Author-supplied keywords

Cite this document (BETA)

Available from homepages.inf.ed.ac.uk
Page 1
hidden

RT to Win ! Predicting Message Propagation in Twitter

RT to Win! Predicting Message Propagation in Twitter
Sasˇa Petrovic´
10 Crichton Street
Edinburgh EH8 9AB
United Kingdom
sasa.petrovic@ed.ac.uk
Miles Osborne
10 Crichton Street
Edinburgh EH8 9AB
United Kingdom
miles@inf.ed.ac.uk
Victor Lavrenko
10 Crichton Street
Edinburgh EH8 9AB
United Kingdom
vlavrenk@inf.ed.ac.uk
Abstract
Twitter is a very popular way for people to share informa-
tion on a bewildering multitude of topics. Tweets are propa-
gated using a variety of channels: by following users or lists,
by searching or by retweeting. Of these vectors, retweeting
is arguably the most effective, as it can potentially reach the
most people, given its viral nature. A key task is predicting
if a tweet will be retweeted, and solving this problem fur-
thers our understanding of message propagation within large
user communities. We carry out a human experiment on the
task of deciding whether a tweet will be retweeted which
shows that the task is possible, as human performance levels
are much above chance. Using a machine learning approach
based on the passive-aggressive algorithm, we are able to au-
tomatically predict retweets as well as humans. Analyzing the
learned model, we find that performance is dominated by so-
cial features, but that tweet features add a substantial boost.
1 Introduction
Twitter is a microblogging service that allows users to post
short (140 characters in length) messages, called tweets,
which are then read by anyone who subscribed to receive the
author’s updates. Although a very popular means of commu-
nication with over 100 million users, many aspects of Twit-
ter still remain poorly understood. Here we focus on the phe-
nomenon of retweeting, or propagating other people’s posts
to one’s followers. Understanding how retweeting works can
provide insight into how information spreads through large
user communities and also has applications in marketing.
We consider the following questions in this paper: i) is it at
all possible to predict when something will be retweeted?, ii)
can we build models which automatically predict retweets,
and how well do these models perform?, and iii) what factors
contribute most in predicting retweets (e.g., how well can we
predict retweets without even reading the tweet)?
We first conduct a human experiment showing that the
task is possible as humans perform significantly better than
random chance. To automatically predict retweets, we use a
machine learning approach based on the passive-aggressive
algorithm. We adapt this algorithm to take tweet creation
time into account, resulting in the best overall model. Fi-
nally, we take a look at how much different features con-
Copyright c© 2011, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
tribute towards predicting a retweet and find that social fea-
tures (especially number of followers and lists) perform very
well, but that there is a substantial gain in using features of
the tweet itself. A comparison of our best model with hu-
man performance shows that our approach does as well as
humans on this task.
2 Related Work
A very in-depth study of the various aspects of retweeting
was presented in (Boyd, Golder, and Lotan 2010). They ex-
plicitly interviewed Twitter users on the reasons why they
retweet, and on what they retweet the most. While (Boyd,
Golder, and Lotan 2010) provide an interesting insight into
the practice of retweeting, they make no attempt to actually
predict when something will be retweeted.
(Suh et al. 2010) conducted a large-scale analysis of fac-
tors that impact retweeting. They found that the number of
followers and friends have a lot of impact, while, e.g., num-
ber of statuses and favorites do not. They also train a single-
layer perceptron from a small subset of tweets, but do not
use the model for actual prediction. Instead, they only ex-
amine the learned weights of the model. In this paper, we
conduct a much more thorough investigation of the predic-
tion task, while also putting emphasis on efficiency and be-
ing able to deploy our approach on live Twitter data. (Zaman
et al. 2010) use a collaborative filtering approach to predict,
for a pair of users, whether a tweet written by one will be
retweeted by the other user. (Zaman et al. 2010) use a fairly
poor feature set (IDs of both users, their number of follow-
ers, and the words in the tweet) and the task of pairwise pre-
dicting retweets makes this approach infeasible for a large-
scale, especially streaming setting with millions of users.
Other studies of retweets concentrated on analyzing a
small number of very popular tweets and their correspond-
ing retweet networks (Nagarajan, Purohit, and Sheth 2010),
or on predicting information diffusion by analyzing how
tweets on the same topic spread (Yang and Counts 2010).
3 Retweeting
Retweeting is the action of reposting someone else’s tweet
inside your own message stream, and there are generally two
ways to do it on Twitter. Users can either manually edit the
original tweet and add “RT @userA” (or something simi-
lar) to indicate that the original tweet came from userA, or
586
Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media
Page 2
hidden
they can use a retweet button which does not allow them to
change the original tweet. Due to problems with identify-
ing the connection between the original tweet and the sub-
sequent manual retweet (people can do this in any number
of ways), we focus upon retweets made using the retweet
button. In this case, the tweet-retweet connection is unam-
biguously marked in Twitter’s API.
The prediction task is as follows. Tweets arrive one at a
time, and for each one we want to predict whether some-
one will retweet it. There is one caveat here: Twitter only
provides a small sample of the entire stream through their
streaming API. As a result, it is possible that a tweet is
retweeted, but for that retweet to not appear in the sample.
When constructing our training/testing set, we will thus in-
correctly label that tweet as not being retweeted. Unfortu-
nately, there is no getting around this problem, so it should
be kept in mind when reading the results.
Dataset. We run our experiments on a stream of tweets
crawled from the Twitter streaming API1 throughout Octo-
ber 2010. We gathered a total of roughly 21 million tweets.
We split this set into a training and a test set using a 90-10
split such that the test set comprises the last 10% of data
(over 2 million tweets) ordered by time. In total, there were
over 24 million unique tokens in the data.
4 Predicting retweets
4.1 Streaming prediction
We now consider the task of automatically predicting
retweets. The tweets are coming in as a stream of text, and
we want to make the prediction as soon as we see the tweet
and discard it right away – this is the realistic setting in
which an actual system would be deployed. Such treatment
of the task warrants the use of online learning algorithms, as
opposed to traditional, batch ones which operate on the en-
tire dataset. This is why we use the passive-aggressive (PA)
algorithm of (Crammer et al. 2006). PA maintains a linear
decision boundary and for each new example it tries to clas-
sify it correctly with a certain margin, while keeping the
decision boundary as close as possible to the old one. The
choice of the algorithm is not crucial and any approach that
is feature-based would be appropriate. There are three vari-
ations of the PA algorithm, depending on the loss function
used. In this paper we use what (Crammer et al. 2006) refer
to as PA-II version – we tried all three versions of the update
and PA-II performed the best. We set the aggressiveness pa-
rameter C of the PA algorithm to 0.01 in our experiments.
4.2 Time-sensitive modelling
Assuming that each time of day can have some specific rules
as to what gets retweeted (maybe tweets containing the word
“oil” are more retweeted in the morning than in the evening),
we introduce separate, local models, trained only on a par-
ticular subset of data. We use a local model for every hour of
the day, depending on when a tweet was written. This means
that, in addition to one global model, which is trained on all
the data, we also have 24 local models, each one trained only
1http://stream.twitter.com/
on those tweets written in a specific hour. Note that all these
models are trained using the standard PA algorithm. When
we have to make a prediction, we modify the original pre-
diction rule of PA with the following:
yˆ = sgn(〈w
g
, x〉 + λ〈w
l
, x〉), (1)
where w
g
is the global weight vector, w
l
is the local weight
vector for the specific hour the tweet was written in, and all
the weight vectors are L
2
normalized. λ is a weight which
corresponds to the proportion of training examples that the
chosen local model has seen, i.e., λ = n
l
/N , where n
l
is
the total number of training examples that the local model
has seen, and N is the total number of training examples. λ
encodes our confidence in the local classifier – the more ex-
amples it has seen, the more we trust its judgment. Note that
we have tried, instead of fixing λ, using a stacked classifier
trained on 〈w
g
, x〉, 〈w
l
, x〉, and λ as features, but it did not
outperform a model with a fixed λ. We therefore opt to fix λ
for the sake of speed and simplicity.
4.3 Features
We divide the features into two distinct sets: social features
(features related to the author of the tweet), and tweet fea-
tures (which encompass various statistics of the tweet itself,
along with the actual text of the tweet).
Social features. We use the following features related to the
author of the tweet: number of followers, friends, statuses,
favorites, number of times the user was listed, is the
user verified, and is the user’s language English. Num-
ber of followers and friends has been consistently shown
to be a good indicator of retweetability (Suh et al. 2010;
Zaman et al. 2010), whereas the number of statuses and fa-
vorites was not found to have significant impact (Suh et al.
2010). Lists are a way to organize friends into groups ac-
cording to some criteria (e.g., members of family, people
who tweet about complexity in computer science, etc.). If a
user is listed many times, i.e., many lists follow him, this
should mean that he tweets about things that are interest-
ing to a larger user population, and his tweets will reach a
broader audience. Verification is used by Twitter mostly to
confirm the authenticity of celebrity accounts. We found that
91% of tweets written by verified users are retweeted, com-
pared with 6% for tweets where the author is not verified.
This shows that almost anything that celebrities write will
get retweeted, and thus having this feature should improve
performance. Our prior analysis also showed that tweets
written in English are more likely to be retweeted so we use
a binary feature indicating if the user’s language is English.
We are not aware of any prior work that analyzes the effect
of lists, verification, and language on retweetability.
Tweet features. We use the following features related to the
tweet itself: number of hashtags, mentions, URLs, trend-
ing words, length of the tweet, novelty, is the tweet a re-
ply, and the actual words in the tweet. Hashtags, URLs,
and mentions were already shown by (Suh et al. 2010) to
have a high correlation with retweetability. A reply indicates
a direct message from one user to another, so intuitively it
should make the tweet less likely to be retweeted, as it is
not directed to a general audience. Trending topics are a set
587

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

19 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
32% Ph.D. Student
 
26% Researcher (at a non-Academic Institution)
 
11% Other Professional
by Country
 
37% United States
 
11% United Kingdom
 
5% Italy

Groups

ICWSM11