Spamscatter : Characterizing Internet Scam Hosting Infrastructure
Available from portal.acm.org
Page 1
Spamscatter : Characterizing Internet Scam Hosting Infrastructure
Spamscatter: Characterizing Internet Scam Hosting Infrastructure
David S. Anderson Chris Fleizach Stefan Savage Geoffrey M. Voelker
Collaborative Center for Internet Epidemiology and Defenses
Department of Computer Science and Engineering
University of California, San Diego
Abstract
Unsolicited bulk e-mail, or SPAM, is a means to an end.
For virtually all such messages, the intent is to attract the
recipient into entering a commercial transaction typi-
cally via a linked Web site. While the prodigious infras-
tructure used to pump out billions of such solicitations is
essential, the engine driving this process is ultimately the
point-of-sale the various money-making scams
that extract value from Internet users. In the hopes of
better understanding the business pressures exerted on
spammers, this paper focuses squarely on the Internet in-
frastructure used to host and support such scams. We
describe an opportunistic measurement technique called
spamscatter that mines emails in real-time, follows the
embedded link structure, and automatically clusters the
destination Web sites using image shingling to capture
graphical similarity between rendered sites. We have
implemented this approach on a large real-time spam
feed (over 1M messages per week) and have identi ed
and analyzed over 2,000 distinct scams on 7,000 distinct
servers.
1 Introduction
Few Internet security issues have attained the universal
public recognition or contempt of unsolicited bulk email
SPAM. In 2006, industry estimates suggest that such
messages comprise over 80% over all Internet email with
a total volume up to 85 billion per day [15,17]. The scale
of these numbers underscores the prodigious delivery in-
frastructures developed by spammers and in turn mo-
tivates the more than $1B spent annually on anti-spam
technology. However, the engine that drives this arms
race is not spam itself which is simply a means to an
end but the various money-making scams (legal or
illegal) that extract value from Internet users.
In this paper, we focus on the Internet infrastructure
used to host and support such scams. In particular, we
analyze spam-advertised Web servers that offer merchan-
dise and services (e.g., pharmaceuticals, luxury watches,
mortgages) or use malicious means to defraud users (e.g.,
phishing, spyware, trojans). Unlike mail-relays or bots,
scam infrastructure is directly implicated in the spam
pro t cycle and thus considerably rarer and more valu-
able. For example, a given spam campaign may use
thousands of mail relay agents to deliver its millions of
messages, but only use a single server to handle requests
from recipients who respond. Consequently, the avail-
ability of scam infrastructure is critical to spam prof-
itability a single takedown of a scam server or a spam-
mer redirect can curtail the earning potential of an entire
spam campaign.
The goal of this paper is to characterize scam infras-
tructure and use this data to better understand the dy-
namics and business pressures exerted on spammers. To
identify scam infrastructure, we employ an opportunis-
tic technique called spamscatter. The underlying prin-
ciple is that each scam is, by necessity, identi ed in the
link structure of associated spams. To this end, we have
built a system that mines email, identi es URLs in real
time and follows such links to their eventual destina-
tion server (including any redirection mechanisms put in
place). We further identify individual scams by cluster-
ing scam servers whose rendered Web pages are graph-
ically similar using a technique called image shingling.
Finally, we actively probe the scam servers on an ongo-
ing basis to characterize dynamic behaviors like avail-
ability and lifetime. Using the spamscatter technique on
a large real-time spam feed (roughly 150,000 per day) we
have identi ed over 2,000 distinct scams hosted across
more than 7,000 distinct servers. Further, we character-
ize the availability of infrastructure implicated in these
scams and the relationship with business-related factors
such as scam type , location and blacklist inclusion.
The remainder of this paper is structured as follows.
Section 2 reviews related measurement studies similar in
topic or technique. In Section 3 we outline the struc-
David S. Anderson Chris Fleizach Stefan Savage Geoffrey M. Voelker
Collaborative Center for Internet Epidemiology and Defenses
Department of Computer Science and Engineering
University of California, San Diego
Abstract
Unsolicited bulk e-mail, or SPAM, is a means to an end.
For virtually all such messages, the intent is to attract the
recipient into entering a commercial transaction typi-
cally via a linked Web site. While the prodigious infras-
tructure used to pump out billions of such solicitations is
essential, the engine driving this process is ultimately the
point-of-sale the various money-making scams
that extract value from Internet users. In the hopes of
better understanding the business pressures exerted on
spammers, this paper focuses squarely on the Internet in-
frastructure used to host and support such scams. We
describe an opportunistic measurement technique called
spamscatter that mines emails in real-time, follows the
embedded link structure, and automatically clusters the
destination Web sites using image shingling to capture
graphical similarity between rendered sites. We have
implemented this approach on a large real-time spam
feed (over 1M messages per week) and have identi ed
and analyzed over 2,000 distinct scams on 7,000 distinct
servers.
1 Introduction
Few Internet security issues have attained the universal
public recognition or contempt of unsolicited bulk email
SPAM. In 2006, industry estimates suggest that such
messages comprise over 80% over all Internet email with
a total volume up to 85 billion per day [15,17]. The scale
of these numbers underscores the prodigious delivery in-
frastructures developed by spammers and in turn mo-
tivates the more than $1B spent annually on anti-spam
technology. However, the engine that drives this arms
race is not spam itself which is simply a means to an
end but the various money-making scams (legal or
illegal) that extract value from Internet users.
In this paper, we focus on the Internet infrastructure
used to host and support such scams. In particular, we
analyze spam-advertised Web servers that offer merchan-
dise and services (e.g., pharmaceuticals, luxury watches,
mortgages) or use malicious means to defraud users (e.g.,
phishing, spyware, trojans). Unlike mail-relays or bots,
scam infrastructure is directly implicated in the spam
pro t cycle and thus considerably rarer and more valu-
able. For example, a given spam campaign may use
thousands of mail relay agents to deliver its millions of
messages, but only use a single server to handle requests
from recipients who respond. Consequently, the avail-
ability of scam infrastructure is critical to spam prof-
itability a single takedown of a scam server or a spam-
mer redirect can curtail the earning potential of an entire
spam campaign.
The goal of this paper is to characterize scam infras-
tructure and use this data to better understand the dy-
namics and business pressures exerted on spammers. To
identify scam infrastructure, we employ an opportunis-
tic technique called spamscatter. The underlying prin-
ciple is that each scam is, by necessity, identi ed in the
link structure of associated spams. To this end, we have
built a system that mines email, identi es URLs in real
time and follows such links to their eventual destina-
tion server (including any redirection mechanisms put in
place). We further identify individual scams by cluster-
ing scam servers whose rendered Web pages are graph-
ically similar using a technique called image shingling.
Finally, we actively probe the scam servers on an ongo-
ing basis to characterize dynamic behaviors like avail-
ability and lifetime. Using the spamscatter technique on
a large real-time spam feed (roughly 150,000 per day) we
have identi ed over 2,000 distinct scams hosted across
more than 7,000 distinct servers. Further, we character-
ize the availability of infrastructure implicated in these
scams and the relationship with business-related factors
such as scam type , location and blacklist inclusion.
The remainder of this paper is structured as follows.
Section 2 reviews related measurement studies similar in
topic or technique. In Section 3 we outline the struc-
Page 2
ture and lifecycle of Internet scams, and describe in de-
tail one of the more extensive scams from our trace as
a concrete example. Section 4 describes our measure-
ment methodology, including our probing system, image
shingling algorithm, and spam feed. In Section 5, we
analyze a wide range of characteristics of Internet scam
infrastructure based upon the scams we identify in our
spam feed. Finally, Section 6 summarizes our ndings
and concludes.
2 Related work
Spamscatter is an opportunistic network measurement
technique [5], taking advantage of spurious traf c
in this case spam to gain insight into hidden as-
pects of the Internet in this case scam hosting infras-
tructure. As with other opportunistic measurement tech-
niques, such as backscatter to measure Internet denial-
of-service activity [20], network telescopes and Internet
sinks [32] to measure Internet worm outbreaks [19, 21],
and spam to measure spam relays [27], spamscatter pro-
vides a mechanism for studying global Internet behavior
from a single or small number of vantage points.
We are certainly not the rst to use spam for oppor-
tunistic measurement. Perhaps the work most closely
related to ours is Ramachandran and Feamster’s recent
study using spam to characterize the network behavior of
the spam relays that sent it [27]. Using extensive spam
feeds, they categorized the network and geographic loca-
tion, lifetime, platform, and network evasion techniques
of spam relay infrastructure. They also evaluated the ef-
fectiveness of using network-level properties of spam re-
lays, such as IP blacklists and suspect BGP announce-
ments, to lter spam. When appropriate in our analyses,
we compare and contrast characteristics of spam relays
and scam hosts; some scam hosts also serve as spam re-
lays, for example. In general, however, due to the differ-
ent requirements of the two underground services, they
exhibit different characteristics; scam hosts, for exam-
ple, have longer lifetimes and are more concentrated in
the U.S.
The Webb Spam Corpus effort harvests URLs from
spam to create a repository of Web spam pages, Web
pages created to in uence Web search engine results or
deceive users [31]. Although both their effort and our
own harvest URLs from spam, the two projects differ
in their use of the harvested URLs. The Webb Spam
Corpus downloads and stores HTML content to create
an of ine data set for training classi ers of Web spam
pages. Spamscatter probes sites and downloads content
over time, renders browser screenshots to identify URLs
referencing the same scam, and analyzes various charac-
teristics of the infrastructure hosting scams.
Both community and commercial services consume
URLs extracted from spam. Various community services
mine spam to speci cally identify and track phishing
sites, either by examining spam from their own feeds or
collecting spam email and URLs submitted by the com-
munity [1, 6, 22, 25]. Commercial Web security and l-
tering services, such as Websense and Brightcloud, track
and analyze Web sites to categorize and lter content,
and to identify phishing sites and sites hosting other po-
tentially malicious content such as spyware and keylog-
gers. Sites advertised in spam provide an important data
source for such services. While we use similar data in our
work, our goal is infrastructure characterization rather
than operational ltering.
Botnets can play a role in the scam host infrastructure,
either by hosting the spam relays generating the spam
we see or by hosting the scam servers. A number of
recent efforts have developed techniques for measuring
botnet structure, behavior, and prevalence. Cook et al. [9]
tested the feasibility of using honeypots to capture bots,
and proposed a combination of passive host and network
monitoring to detect botnets. B¤acher et al. [23] used hon-
eynets to capture bots, in ltrate their command and con-
trol channel, and monitor botnet activity. Rajab et al. [26]
combined a number of measurement techniques, includ-
ing malware collection, IRC command and control track-
ing, and DNS cache probing. The last two approaches
have provided substantial insight into botnet activity by
tracking hundreds of botnets over periods of months. Ra-
machandran and Feamster [27] provided strong evidence
that botnets are commonly used as platforms for spam
relays; our results suggest botnets are not as common for
scam hosting.
We developed an image shingling algorithm to deter-
mine the equivalance of screenshots of rendered Web
pages. Previous efforts have developed techniques to de-
termine the equivalence of transformed images as well.
For instance, the SpoofGuard anti-phishing Web browser
plugin compares images on Web pages with a database of
corporate logos [7] to identify Web site spoo ng. Spoof-
Guard compares images using robust image hashing, an
approach employing signal processing techniques to cre-
ate a compressed representation of an image [30]. Robust
image hashing works well against a number of different
image transformations, such as cropping, scaling, and l-
tering. However, unlike image shingling, image hashing
is not intended to compare images where substantial re-
gions have completely different content; re nements to
image hashing improve robustness (e.g., [18,28]), but do
not fundamentally extend the original set of transforms.
3 The life and times of an Internet scam
In this section we outline the structure and life cycle
of Internet scams, and describe in detail one of the
tail one of the more extensive scams from our trace as
a concrete example. Section 4 describes our measure-
ment methodology, including our probing system, image
shingling algorithm, and spam feed. In Section 5, we
analyze a wide range of characteristics of Internet scam
infrastructure based upon the scams we identify in our
spam feed. Finally, Section 6 summarizes our ndings
and concludes.
2 Related work
Spamscatter is an opportunistic network measurement
technique [5], taking advantage of spurious traf c
in this case spam to gain insight into hidden as-
pects of the Internet in this case scam hosting infras-
tructure. As with other opportunistic measurement tech-
niques, such as backscatter to measure Internet denial-
of-service activity [20], network telescopes and Internet
sinks [32] to measure Internet worm outbreaks [19, 21],
and spam to measure spam relays [27], spamscatter pro-
vides a mechanism for studying global Internet behavior
from a single or small number of vantage points.
We are certainly not the rst to use spam for oppor-
tunistic measurement. Perhaps the work most closely
related to ours is Ramachandran and Feamster’s recent
study using spam to characterize the network behavior of
the spam relays that sent it [27]. Using extensive spam
feeds, they categorized the network and geographic loca-
tion, lifetime, platform, and network evasion techniques
of spam relay infrastructure. They also evaluated the ef-
fectiveness of using network-level properties of spam re-
lays, such as IP blacklists and suspect BGP announce-
ments, to lter spam. When appropriate in our analyses,
we compare and contrast characteristics of spam relays
and scam hosts; some scam hosts also serve as spam re-
lays, for example. In general, however, due to the differ-
ent requirements of the two underground services, they
exhibit different characteristics; scam hosts, for exam-
ple, have longer lifetimes and are more concentrated in
the U.S.
The Webb Spam Corpus effort harvests URLs from
spam to create a repository of Web spam pages, Web
pages created to in uence Web search engine results or
deceive users [31]. Although both their effort and our
own harvest URLs from spam, the two projects differ
in their use of the harvested URLs. The Webb Spam
Corpus downloads and stores HTML content to create
an of ine data set for training classi ers of Web spam
pages. Spamscatter probes sites and downloads content
over time, renders browser screenshots to identify URLs
referencing the same scam, and analyzes various charac-
teristics of the infrastructure hosting scams.
Both community and commercial services consume
URLs extracted from spam. Various community services
mine spam to speci cally identify and track phishing
sites, either by examining spam from their own feeds or
collecting spam email and URLs submitted by the com-
munity [1, 6, 22, 25]. Commercial Web security and l-
tering services, such as Websense and Brightcloud, track
and analyze Web sites to categorize and lter content,
and to identify phishing sites and sites hosting other po-
tentially malicious content such as spyware and keylog-
gers. Sites advertised in spam provide an important data
source for such services. While we use similar data in our
work, our goal is infrastructure characterization rather
than operational ltering.
Botnets can play a role in the scam host infrastructure,
either by hosting the spam relays generating the spam
we see or by hosting the scam servers. A number of
recent efforts have developed techniques for measuring
botnet structure, behavior, and prevalence. Cook et al. [9]
tested the feasibility of using honeypots to capture bots,
and proposed a combination of passive host and network
monitoring to detect botnets. B¤acher et al. [23] used hon-
eynets to capture bots, in ltrate their command and con-
trol channel, and monitor botnet activity. Rajab et al. [26]
combined a number of measurement techniques, includ-
ing malware collection, IRC command and control track-
ing, and DNS cache probing. The last two approaches
have provided substantial insight into botnet activity by
tracking hundreds of botnets over periods of months. Ra-
machandran and Feamster [27] provided strong evidence
that botnets are commonly used as platforms for spam
relays; our results suggest botnets are not as common for
scam hosting.
We developed an image shingling algorithm to deter-
mine the equivalance of screenshots of rendered Web
pages. Previous efforts have developed techniques to de-
termine the equivalence of transformed images as well.
For instance, the SpoofGuard anti-phishing Web browser
plugin compares images on Web pages with a database of
corporate logos [7] to identify Web site spoo ng. Spoof-
Guard compares images using robust image hashing, an
approach employing signal processing techniques to cre-
ate a compressed representation of an image [30]. Robust
image hashing works well against a number of different
image transformations, such as cropping, scaling, and l-
tering. However, unlike image shingling, image hashing
is not intended to compare images where substantial re-
gions have completely different content; re nements to
image hashing improve robustness (e.g., [18,28]), but do
not fundamentally extend the original set of transforms.
3 The life and times of an Internet scam
In this section we outline the structure and life cycle
of Internet scams, and describe in detail one of the
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
15 Readers on Mendeley
by Discipline
by Academic Status
73% Ph.D. Student
13% Student (Master)
7% Doctoral Student
by Country
53% United States
13% China
7% South Korea


