Beyond Trending Topics : Real-World Event Identification on Twitter
Abstract
User-contributed messages on social media sites such as Twitter have emerged aspowerful, real-time means of information sharing on the Web. These short messages tend to reflect a variety of events in real time, making Twitter particularly well suited as a source of real-time event content. In this paper, we explore approaches for analyzing the stream of Twitter messages to distinguish between messages about real-world events andnon-event messages. Our approach relies on a rich family of aggregatestatistics of topically similar message clusters. Large-scale experiments over millions of Twitter messages show the effectiveness of our approach for surfacing real-world event content on Twitter.
Beyond Trending Topics : Real-World Event Identification on Twitter
Hila Becker
Columbia University
hila@cs.columbia.edu
Mor Naaman
Rutgers University
mor@rutgers.edu
Luis Gravano
Columbia University
gravano@cs.columbia.edu
Abstract
User-contributed messages on social media sites such as
Twitter have emerged as powerful, real-time means of infor-
mation sharing on the Web. These short messages tend to re-
flect a variety of events in real time, making Twitter partic-
ularly well suited as a source of real-time event content. In
this paper, we explore approaches for analyzing the stream
of Twitter messages to distinguish between messages about
real-world events and non-event messages. Our approach re-
lies on a rich family of aggregate statistics of topically sim-
ilar message clusters. Large-scale experiments over millions
of Twitter messages show the effectiveness of our approach
for surfacing real-world event content on Twitter.
1 Introduction
Social media sites (e.g., Twitter, Facebook, and YouTube)
have emerged as powerful means of communication for peo-
ple looking to share and exchange information on a wide va-
riety of real-world events. These events range from popular,
widely known ones (e.g., a concert by a popular music band)
to smaller scale, local events (e.g., a local social gathering, a
protest, or an accident). Short messages posted on social me-
dia sites such as Twitter can typically reflect these events as
they happen. For this reason, the content of such social me-
dia sites is particularly useful for real-time identification of
real-world events and their associated user-contributed mes-
sages, which is the problem that we address in this paper.
Twitter messages reflect useful event information for a
variety of events of different types and scale. These event
messages can provide a set of unique perspectives, regard-
less of the event type (Diakopoulos, Naaman, and Kivran-
Swaine 2010; Yardi and boyd 2010), reflecting the points
of view of users who are interested or participate in an
event. In particular, for unplanned events (e.g., the Iran elec-
tion protests, earthquakes), Twitter users sometimes spread
news prior to the traditional news media (Kwak et al. 2010;
Sakaki, Okazaki, and Matsuo 2010). Even for planned
events (e.g., the 2010 Apple Developers conference), Twitter
users often post messages in anticipation of the event.
Identifying events in real time on Twitter is a challenging
problem, due to the heterogeneity and immense scale of the
data. Twitter users post messages with a variety of content
Copyright c
2011, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
types, including personal updates and various bits of infor-
mation (Naaman, Boase, and Lai 2010). While much of the
content on Twitter is not related to any particular real-world
event, informative event messages nevertheless abound. As
an additional challenge, Twitter messages, by design, con-
tain little textual information, and often exhibit low quality
(e.g., with typos and ungrammatical sentences).
Several research efforts have focused on identifying
events in social media in general, and on Twitter in particular
(Becker, Naaman, and Gravano 2010; Sakaki, Okazaki, and
Matsuo 2010; Sankaranarayanan et al. 2009). Recent work
on Twitter has started to process data as a stream, as it is
produced, but has mainly focused on identifying events of
a particular type (e.g., news events (Sankaranarayanan et al.
2009), earthquakes (Sakaki, Okazaki, and Matsuo 2010)).
Other work identifies the first Twitter message associated
with an event (Petrovic´, Osborne, and Lavrenko 2010).
Our focus in this work is on online identification of real-
world event content. We identify each event—and its asso-
ciated Twitter messages—using an online clustering tech-
nique that groups together topically similar tweets (Section
3.1). We then compute revealing features for each cluster to
help determine which clusters correspond to events (Section
3.2). We use these features to train a classifier to distinguish
between event and non-event clusters (Section 3.3). We val-
idate the effectiveness of our techniques using a dataset of
over 2.6 million Twitter messages (Section 4) and then dis-
cuss our findings and future work (Section 5).
2 Background and Problem Definition
In this section, we provide an overview of Twitter and then
define the problem that we address in this paper.
2.1 Background: Twitter
Twitter is a popular social media site that allows users to post
short textual messages, or tweets , which are up to 140 char-
acters long. Twitter users can use a hashtag annotation for-
mat (e.g., #sb45) to indicate what their messages are about
(e.g., “watching Superbowl 45 #sb45”). In addition, Twit-
ter allows several ways for users to converse and interact
by referencing each other in messages using the @ sym-
bol. Twitter currently employs a proprietary algorithm to
display trending topics, consisting of terms and phrases that
exhibit “trending” behavior. While Twitter’s trending topics
ten include keywords for popular conversation topics (e.g.,
“#bieberfever,” “getting ready”), with no discrimination be-
tween the different types of content.
2.2 Problem Definition
We now define the notion of real-world event in the context
of a Twitter message stream, and provide a definition of the
problem that we address in this paper.
The definition of event has received attention across
fields, from philosophy (Events 2002) to cognitive psychol-
ogy (Zacks and Tversky 2001). In information retrieval, the
concept of event has prominently been studied for event de-
tection in news (Allan 2002). We borrow from this research
to define an event in the context of our work. Specifically,
we define an event as a real-world occurrence e with (1) an
associated time period Te and (2) a time-ordered stream of
Twitter messages Me, of substantial volume, discussing the
occurrence and published during time Te.
According to this definition, events on Twitter include
widely known occurrences such as the presidential inaugu-
ration, and also local or community-specific events such as a
high-school homecoming game or the ICWSM conference.
Non-event content, of course, is prominent on Twitter and
similar systems where people share various types of con-
tent such as personal updates, random thoughts and musings,
opinions, and information (Naaman, Boase, and Lai 2010).
As a challenge, non-event content also includes forms
of Twitter activity that trigger substantial message vol-
ume over specific time periods (Becker, Naaman, and Gra-
vano 2011b), which is a common characteristic of event
content. Examples of such non-event activity are Twitter-
specific conversation topics or memes (e.g., using the hash-
tag #thingsparentssay). Our goal is to differentiate between
messages about real-world events and non-event messages,
where non-event messages include those for “trending” ac-
tivities that are Twitter-centric but do not reflect any real-
world occurrences. We now define our problem, as follows:
Consider a time-ordered stream of Twitter messages
M . At any point in time t, our goal is to identify real-
world events and their associated Twitter messages
present in M and published before time t. Furthermore,
we assume an online setting for our problem, where we
only have access to messages posted before time t.
3 Separating Event and Non-Event Content
We propose to address the event identification problem using
an online clustering and filtering framework. We describe
this framework in detail (Section 3.1), and then discuss the
different types of features that we extract for clusters (Sec-
tion 3.2), as well as the classification model that we use (Sec-
tion 3.3) to separate event and non-event clusters.
3.1 Clustering and Classification Framework
We elected to use an incremental, online clustering algo-
rithm in order to effectively cluster a stream of Twitter mes-
sages in real time. For such a task, we must choose a clus-
tering algorithm that is scalable, and that does not require
a priori knowledge of the number of clusters, since Twitter
messages are constantly evolving and new events get added
to the stream over time. Based on these observations, we
propose using an incremental clustering algorithm with a
threshold parameter that is tuned empirically during a train-
ing phase. Such a clustering algorithm considers each mes-
sage in turn, and determines a suitable cluster assignment
based on the message’s similarity to existing clusters. (See
(Becker, Naaman, and Gravano 2011a) for further details.)
To identify all event clusters in the stream, we compute
a variety of revealing features using statistics of the cluster
messages (Section 3.2). Since the clusters constantly evolve
over time, we must periodically update the features for each
cluster and compute features of newly formed clusters. We
subsequently proceed to invoke a classification model (Sec-
tion 3.3) that, given a cluster’s feature representation, de-
cides whether or not the cluster, and its associated messages,
contains event information. With the appropriate choice of
classification model, we can also select the top events in the
stream at any point in time, according to the clusters’ prob-
ability of belonging to the event class.
3.2 Cluster-Level Event Features
We compute features of Twitter message clusters in order
to reveal characteristics that may help detect clusters that
are associated with events. We examine several broad cate-
gories of features that describe different aspects of the clus-
ters we wish to model. Specifically, we consider temporal,
social, topical, and Twitter-centric features. We summarize
these features below. (See (Becker, Naaman, and Gravano
2011a) for further details.)
Temporal Features: The volume of messages for an event
e during the event’s associated time Te exhibits unique char-
acteristics (see the definition of event in Section 2.2). To ef-
fectively identify events in our framework, a key challenge
is to capture this temporal behavior with a set of descrip-
tive features for our classifier. We design a set of temporal
features to characterize the volume of frequent cluster terms
(i.e., terms that appear frequently in the set of messages as-
sociated with a cluster) over time. These features capture any
deviation from expected message volume for any frequent
cluster term or a set of frequent cluster terms. Additionally,
we also compute the quality of fit of an exponential function
to the term’s hourly binned message histogram.
Social Features: We designed social features to capture the
interaction of users in a cluster’s messages. These inter-
actions might be different between events, Twitter-centric
activities, and other non-event messages (Becker, Naaman,
and Gravano 2011b). User interactions on Twitter include
retweets (forwarding, indicated by RT @username), replies
(conversation, indicated by @username in the beginning of
the tweet), and mentions (indicated by @username any-
where except the beginning of the tweet). Our social features
include the percentage of messages containing each of these
types of user interaction out of all messages in a cluster.
Topical Features: Topical features describe the topical co-
herence of a cluster, based on a hypothesis that event clus-
ters tend to revolve around a central topic, whereas non-
event clusters do not. Rather, non-event clusters often center
around a few terms (e.g., “sleep,” “work”) that do not reflect
a single theme (e.g., with some messages about sleep, others
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


