Sign up & Download
Sign in

Detecting Spammers on Twitter

by Gabriel Magno, Tiago Rodrigues
Science (2010)

Abstract

With millions of users tweeting around the world, real time search systems and different types of mining tools are emerging to allow people tracking the repercussion of events and news on Twitter. However, although appealing as mechanisms to ease the spread of news and allow users to discuss events and post their status, these services open opportunities for new forms of spam. Trending topics, the most talked about items on Twitter at a given point in time, have been seen as an opportunity to generate traffic and revenue. Spammers post tweets containing typical words of a trending topic and URLs, usually obfuscated by URL shorteners, that lead users to completely unrelated websites. This kind of spam can contribute to de-value real time search services unless mechanisms to fight and stop spammers can be found. In this paper we consider the problem of detecting spammers on Twitter. We first collected a large dataset of Twitter that includes more than 54 million users, 1.9 billion links, and almost 1.8 billion tweets. Using tweets related to three famous trending topics from 2009, we construct a large labeled collection of users, manually classified into spammers and non-spammers. We then identify a number of characteristics related to tweet content and user social behavior, which could potentially be used to detect spammers. We used these characteristics as attributes of machine learning process for classifying users as either spammers or nonspammers. Our strategy succeeds at detecting much of the spammers while only a small percentage of non-spammers are misclassified. Approximately 70% of spammers and 96% of non-spammers were correctly classified. Our results also highlight the most important attributes for spam detection on Twitter

Cite this document (BETA)

Available from www.nber.org
Page 1
hidden

Detecting Spammers on Twitter

Detecting Spammers on Twitter
Fabrı´cio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgı´lio Almeida
Computer Science Department, Universidade Federal de Minas Gerais
Belo Horizonte, Brazil
{fabricio, magno, tiagorm, virgilio}@dcc.ufmg.br
ABSTRACT
With millions of users tweeting around the world, real
time search systems and different types of mining tools are
emerging to allow people tracking the repercussion of events
and news on Twitter. However, although appealing as mech-
anisms to ease the spread of news and allow users to discuss
events and post their status, these services open opportu-
nities for new forms of spam. Trending topics, the most
talked about items on Twitter at a given point in time, have
been seen as an opportunity to generate traffic and revenue.
Spammers post tweets containing typical words of a trend-
ing topic and URLs, usually obfuscated by URL shorteners,
that lead users to completely unrelated websites. This kind
of spam can contribute to de-value real time search services
unless mechanisms to fight and stop spammers can be found.
In this paper we consider the problem of detecting spam-
mers on Twitter. We first collected a large dataset of Twit-
ter that includes more than 54 million users, 1.9 billion links,
and almost 1.8 billion tweets. Using tweets related to three
famous trending topics from 2009, we construct a large la-
beled collection of users, manually classified into spammers
and non-spammers. We then identify a number of charac-
teristics related to tweet content and user social behavior,
which could potentially be used to detect spammers. We
used these characteristics as attributes of machine learn-
ing process for classifying users as either spammers or non-
spammers. Our strategy succeeds at detecting much of the
spammers while only a small percentage of non-spammers
are misclassified. Approximately 70% of spammers and 96%
of non-spammers were correctly classified. Our results also
highlight the most important attributes for spam detection
on Twitter.
Keywords: spam, twitter, real time search, spammer, mi-
croblogging, online social networks, machine learning.
1. INTRODUCTION
Twitter has recently emerged as a popular social system
where users share and discuss about everything, including
news, jokes, their take about events, and even their mood.
With a simple interface where only 140 character messages
can be posted, Twitter is increasingly becoming a system
for obtaining real time information. When a user posts a
tweet, it is immediately delivered to her followers, allowing
CEAS 2010 - Seventh annual Collaboration, Electronic messaging, Anti-
Abuse and Spam Conference July 13-14, 2010, Redmond, Washington, US
them to spread the received information even more. In addi-
tion to be received by followers, tweets can also be retrieved
through search systems and other tools. With the emergence
of real time search systems and meme-tracking services, the
repercussion of all kinds of events and news are beginning
to be registered with practically no delay between the cre-
ation and availability for retrieval of content. As example,
Google, Bing, Twitter and other meme-tracking services are
mining real time tweets to find out what is happening in the
world with minimum delay [4].
However, although appealing as mechanisms to ease the
spread of news and allow users to discuss events and post
their status, these services also open opportunities for new
forms of spam. For instance, trending topics, the most
talked about items on Twitter at a given point in time,
have been seen as an opportunity to generate traffic and
revenue. When noteworthy events occur, thousands of users
tweet about it and make them quickly become trending
topics. These topics become the target of spammers that
post tweets containing typical words of the trending topic,
but URLs that lead users to completely unrelated websites.
Since tweets are usually posted containing shortened URLs,
it is difficult for users to identify the URL content without
loading the webpage. This kind of spam can contribute to
reduce the value of real time search services unless mecha-
nisms to fight and stop spammers can be found.
Tweet spammers are driven by several goals, such as to
spread advertise to generate sales, disseminate pornography,
viruses, phishing, or simple just to compromise system repu-
tation. They not only pollute real time search, but they can
also interfere on statistics presented by tweet mining tools
and consume extra resources from users and systems. All in
all, spam wastes human attention, maybe the most valuable
resource in the information age.
Given that spammers are increasingly arriving on Twit-
ter, the success of real time search services and mining tools
relies at the ability to distinguish valuable tweets from the
spam storm. In this paper, we firstly address the issue of
detecting spammers on Twitter. To do it, we propose a 4-
step approach. First, we crawled a near-complete dataset
from Twitter, containing more than 54 million users, 1.9
billion links, and almost 1.8 billion tweets. Second, we cre-
ated a labeled collection with users “manually” classified as
spammers and non-spammers. Third, we conducted a study
about the characteristics of tweet content and user behavior
on Twitter aiming at understanding their relative discrim-
inative power to distinguish spammers and non-spammers.
Lastly, we investigate the feasibility of applying a super-
Page 2
hidden
vised machine learning method to identify spammers. We
found that our approach is able to correctly identify the
majority of the spammers (70%), misclassifying only 3.6%
of non-spammers. We also investigate different tradeoffs for
our classification approach namely, the attribute importance
and the use of different attribute sets. Our results show that
even using different subsets of attributes, our classification
approach is able to detect spammers with high accuracy. We
also investigate the detection of spam instead of spammers.
Although results for this approach showed to be compet-
itive, the spam classification is more susceptible to spam-
mers that adapt their strategies since it is restricted to a
small and simple set of attributes related to characteristics
of tweets.
The rest of the paper is organized as follows. The next
section presents a background on Twitter and provides the
definition of spam used along this work. Section 3 describes
our crawling strategy and the labeled collection built from
the crawled dataset. Section 4 investigates a set of user at-
tributes and their ability to distinguish spammers and non-
spammers. Section 5 describes and evaluates our strategies
to detect spammers and Section 6 surveys related work. Fi-
nally, Section 7 offers conclusions and directions for future
work.
2. BACKGROUND AND DEFINITIONS
Twitter is an information sharing system, where users fol-
low other users in order to receive information along the so-
cial links. Such information consists of short text messages
called tweets. Relationship links are directional, meaning
that each user has followers and followees, instead of unidi-
rectional friendship links. Tweets can be repeated through-
out the network, a process called re-tweeting. A retweeted
message usually starts with “RT @username”, where the @
sign represents a reference to the one who originally posted
the messages. Twitter users usually use hashtags (#) to
identify certain topics. Hashtags are similarly to a tag that
is assigned to a tweet in its own body text.
The most popular hashtags or key words that appear on
tweets become trending topics. Most of the trending top-
ics reflect shocking and breaking news or events that ap-
pear in the mass media. Among the most popular events in
2009 that also became popular trending topics are Michael
Jackson’s death, Iran election, and the emergence of the
British singer, Susan Boyle, on the TV show Britain’s Got
Talent [2].
However, the most popular hashtag recorded in 2009 is
not related to news or events that appeared in the tradi-
tional mass media. The hashtag #musicmonday is widely
used by users to weekly announce tips about music, songs, or
concerts. Several users post what kind of song they are lis-
tening to every Monday and add that hashtag so that others
can search. Such hashtags are conventions created by users
that become largely adopted. As example, the first tweet in
our dataset with this hashtag says:
What are you listening to? Tag it, #musicmonday “Come
Together”- The Beatles.
Figure 1 shows part of the results of a search on Twit-
ter for the hashtag #musicmonday. The figure shows three
tweets that appear as result and contains most of the ele-
ments we discussed here. We can note on the figure a list
Figure 1: Ilustrative example of a search on Twitter
for the hashtag #musicmonday
of trending topics, hashtags, retweets, and anonymized user
names. The second tweet is an example of a tweet spam,
since it contains a hashtag completely unrelated to the URL
the tweet points to. In this paper, we consider as spam-
mers on Twitter those users who post at least one tweet
containing a URL considered unrelated to the tweet body
text. Examples of tweet spam are: (i) a URL to a website
containing advertisements completely unrelated to a hash-
tag on the tweet, and (ii) retweets in which legitimate links
are changed to illegitimate ones, but are obfuscated by URL
shorteners.
Although there are other forms of opportunistic actions
in Twitter, not all of them can be considered as spam. As
example, there are opportunistic users that follow a large
number of people in an attempt to be followed back and
then disseminate their messages. Here we do not consider
content received through the social links as spam since users
are free to follow the users they want.
3. DATASETANDLABELEDCOLLECTION
In order to evaluate our approach to detect spammers on
Twitter, we need a labeled collection of users, pre-classified
into spammers and non-spammers. To the best of our knowl-
edge, no such collection is publicly available. We then had
to build one. Next, we describe the strategy used to collect
Twitter in Section 3.1. We then discuss the process used
to select and manually classify a subset of spammers and
non-spammers in Section 3.2.
3.1 Crawling twitter
In analyzing the characteristics of users in Twitter, ideally
we would like to have at our disposal data for each existing
Twitter user, including their social connections, and all the
tweets they ever posted. So, to that end, we asked Twit-
ter to allow us to collect such data and they white-listed
58 servers located at the Max Planck Institute for Software
Systems (MPI-SWS), located in Germany1. Twitter assigns
each user a numeric ID which uniquely identifies the user’s
profile. We launched our crawler in August 2009 to collect
all user IDs ranging from 0 to 80 million. Since no single user
in the collected data had a link to a user whose ID is greater
than 80 million, our crawler has inspected all users with an
account on Twitter. In total, we found 54,981,152 used ac-
counts that were connected to each other by 1,963,263,821
social links. We also collected all tweets ever posted by the
collected users, which consists of a total of 1,755,925,520
1Part of this work was done when the first author was vis-
iting the MPI-SWS

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

42 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
48% Ph.D. Student
 
21% Student (Master)
 
7% Student (Postgraduate)
by Country
 
21% United States
 
12% China
 
12% United Kingdom