A Walk in Facebook : Uniform Sampling of Users in Online Social Networks
- arXiv: 0906.0060v4
Abstract
The popularity of online social networks (OSNs) has given rise to a number of measurements studies that provide a first step towards their understanding. So far, such studies have been based either on complete data sets provided directly by the OSN itself or on Breadth-First-Search (BFS) crawling of the social graph, which does not guarantee good statistical properties of the collected sample. In this paper, we crawl the publicly available social graph and present the first unbiased sampling of Facebook (FB) users using a Metropolis-Hastings random walk with multiple chains. We study the convergence properties of the walk and demonstrate the uniformity of the collected sample with respect to multiple metrics of interest. We provide a comparison of our crawling technique to baseline algorithms, namely BFS and simple random walk, as well as to the 'ground truth' obtained through truly uniform sampling of userIDs. Our contributions lie both in the measurement methodology and in the collected sample. With regards to the methodology, our measurement technique (i) applies and combines known results from random walk sampling specifically in the OSN context and (ii) addresses system implementation aspects that have made the measurement of Facebook challenging so far. With respect to the collected sample: (i) it is the first representative sample of FB users and we plan to make it publicly available; (ii) we perform a characterization of several key properties of the data set, and find that some of them are substantially different from what was previously believed based on non-representative OSN samples.
A Walk in Facebook : Uniform Sampling of Users in Online Social Networks
X
iv
:0
90
6.
00
60
v4
[
cs
.SI
]
4 F
eb
20
11
1
A Walk in Facebook: Uniform Sampling of Users
in Online Social Networks
Minas Gjoka, Maciej Kurant, Carter T. Butts, Athina Markopoulou
California Institute for Telecommunications and Information Technology (CalIT2)
University of California, Irvine
Abstract—Our goal in this paper is to develop a practical
framework for obtaining an unbiased sample of users in an
online social network (OSN), by crawling its social graph. Such a
sample allows to estimate any user property and some topological
properties as well. To this end, first, we consider and compare
several candidate crawling techniques. Two approaches that can
produce unbiased samples are the Metropolis-Hasting random
walk (MHRW) and a re-weighted random walk (RWRW). Both
have pros and cons, which we demonstrate through a comparison
to each other as well as to the ground-truth. In contrast, using
Breadth-First-Search (BFS) or a simple Random Walk (RW)
leads to substantially biased results. Second, and in addition
to offline performance assessment, we introduce online formal
convergence diagnostics to assess sample quality during the data
collection process. We show how these diagnostics can be used to
effectively determine when a random walk sample is of adequate
size and quality. Third, as a case study, we apply the above
methods to sample Facebook users. We collect the first, to the
best of our knowledge, unbiased sample of Facebook users. We
make it publicly available and we use it to characterize several
key properties of Facebook.
Index Terms—Measurements, online social networks, Face-
book, sampling, crawling, random walks, convergence diagnos-
tics.
I. INTRODUCTION
Online Social Networks (OSNs) have recently emerged as
a new Internet killer-application. The adoption of OSNs by
Internet users is off-the-charts with respect to almost every
metric. In November 2010, Facebook, the most popular
OSN, counted more than 500 million members; the total
combined membership in the top five OSNs (Facebook,
QQ, Myspace, Orkut, Twitter) exceeded 1 billion users.
Putting this number into context, the population of OSN users
is approaching 20% of the world population and it is more than
50% of the world’s Internet users. According to Nielsen [1],
users all over the world now spends over 110 billion minutes
on social media sites per month, which accounts for 22% of
all time spent online, surpassing even email messaging as the
most preferred online activity. According to Alexa [2], a well-
known traffic analytics website, Facebook is the second most
visited website on the Internet (the first being Google) with
each user spending 30 minutes on average per day on the site
(more than the time spent on Google). Four of the top five
OSNs are also contained in Alexa’s top 15 websites in regard
to traffic rankings. Clearly, OSNs in general, and Facebook
in particular, have become an important phenomenon on the
Internet, which is worth studying.
OSN data are of interest to multiple disciplines and can
be used, for example, to design viral marketing strategies,
to model the spread of influence through social networks,
to conduct low cost surveys in large scale, to detect hid-
den community structures etc. From a networking/systems
perspective, there are several reasons to study and better
understand OSNs. One motivation is to optimize content
delivery to users. Indeed, the aforementioned statistics show
that OSNs already play an increasingly important role in
generating and re-directing Internet traffic. OSN operators can
optimize data storage in the cloud to reduce response times,
e.g., by designing middleware that takes into account the
characteristics of the social graph to achieve data locality while
minimizing replication [3]. Network operators can exploit
OSNs to optimize content delivery by predicting user demand
and pre-fetching and caching content, as proposed in [4]. [5]
performed a measurement study of the network-level effects
of popular third party applications on Facebook. Another
use of OSNs is to create algorithms that can exploit trusted
or influential users, e.g., to thwart unwanted communication
while not impeding legitimate communication [6]; to utilize
social trust for collaborative spam filtering [7]; or to enable
online personas to cost-effectively obtain credentials [8].
The immense interest generated by OSNs has given rise
to a number of measurement and characterization studies that
attempt to provide a first step towards their understanding.
Only a very small number of these studies are based on
complete datasets provided by the OSN operators [9], [10]. A
few other studies have collected a complete view of specific
parts of OSNs; e.g., [11] collected the social graph of the
Harvard university network. However, the complete dataset
is typically unavailable to researchers, as most OSNs are
unwilling to share their company’s data even in an anonymized
form, primarily due to privacy concerns.
Furthermore, the large size 1 and access limitations of most
OSN services (e.g., login requirements, limited view, API
query limits) make it difficult or nearly impossible to fully
crawl the social graph of an OSN. In many cases, HTML
1A back-of-the-envelope calculation of the effort needed to crawl
Facebook’s social graph is as follows. In December 2010, Facebook
advertised more than 500 million active users, each encoded by 64 bits (4
bytes) long userID, and 130 friends per user on average. Therefore, the
raw topological data alone, without any node attributes, amounts to at least
500M × 130× 8bytes ≃ 520GBytes.
Instead, it would be desirable to obtain and use a small but
representative sample.
Therefore, sampling techniques become essential for esti-
mation of OSN properties, in practice. While sampling can,
in principle, allow precise inference from a relatively small
number of observations, this depends critically on the ability
to draw a sample with known statistical properties. The lack
of a sampling frame (i.e., a complete list of users, from
which individuals can be directly sampled) for most OSNs
makes principled sampling especially difficult. To evade this
limitation, our work focuses on sampling methods that are
based on crawling of friendship relations - a fundamental
primitive in any OSN.
Our goal in this paper is to provide a framework for
obtaining an asymptotically unbiased sample of OSN users
by crawling the social graph. We provide practical recom-
mendations for appropriately implementing the framework,
including: the choice of crawling technique; the use of online
convergence diagnostics; and the implementation of high-
performance crawlers. We then apply our framework to an
important case-study - Facebook. More specifically, we
make the following three contributions.
Our first contribution is the comparison of several candidate
graph-crawling techniques in terms of sampling bias and
efficiency. First, we consider Breadth-First-Search (BFS) - the
most widely used technique for measurements of other OSNs
[9], [12] and Facebook [13]. BFS sampling is known to
introduce bias towards high degree nodes, which is highly
non-trivial to characterize analytically [14], [15]. Second, we
consider Random Walk (RW) sampling, which also leads to
bias towards high degree nodes, but at least its bias can be
quantified by Markov Chain analysis and thus can be corrected
via appropriate re-weighting (RWRW) [16], [17]. Then, we
consider the Metropolis-Hastings Random Walk (MHRW)
that can directly yield a uniform stationary distribution of
users. This technique has been used in the past for P2P
sampling [18], recently for a few OSNs [19], [20], but not for
Facebook. Finally, we also collect a sample (UNI) that rep-
resents the ground truth, i.e., a uniform sample of Facebook
userIDs, selected by a rejection sampling procedure from
Facebook’s 32-bit ID space. Such ground truth is in general
unavailable, and our ability to use it as a basis of comparison
is therefore a valuable asset of this study. We compare all
sampling methods in terms of their bias and convergence
properties. We find that MHRW and RWRW are both able to
collect asymptotically unbiased samples, while BFS and RW
result in significant bias in practice. We also compare MHRW
to RWRW, via analysis, simulation and experimentation and
discuss their pros and cons. The former provides a sample
ready to be used by non-experts, while the latter is more
efficient for all practical purposes.
Our second contribution is that we introduce, for the first
time in this context, the use of formal convergence diagnostics
(namely Geweke and Gelman-Rubin) to assess sample quality
2For the previous example, if we conservatively assume that each user
occupies 500 bytes in the HTML page that contains a user’s friend list, one
would have to download about 260TBytes of HTML data.
in an online fashion. These methods allow us to determine, in
the absence of a ground truth, when a sample is adequate for
subsequent use, and hence when it is safe to stop sampling,
which is a critical issue in implementation.
Our third contribution is that we apply our framework
to an important case-study - Facebook. To the best of
out knowledge, this is the first time that all the aforemen-
tioned techniques have been applied to and compared on
Facebook. We crawl Facebook’s web front-end, which is
highly non-trivial due to various access limitations, and we
provide guidelines for the practical implementation of high-
performance crawlers. We obtain the first unbiased sample
of Facebook users, which we make publicly available [21];
we have received approximately 200 requests for this dataset
in the last six months. Finally, we use the collected datasets
to characterize several key properties of Facebook, includ-
ing user properties (e.g., privacy settings) and topological
properties (e.g., the node degree distribution, clustering and
assortativity, and connectivity between regional networks).
Obtaining a uniform sample of OSN users is interesting
on its own right, as it allows to estimate any user property,
such as age, privacy settings or any other user attribute. We
note that degree distribution is a specific user (node) property
that can be estimated from a uniform sample of users and
happens to contain information about the topology. In addition,
uniform sampling of users is a first step towards estimating
topological properties or the topology itself. For example, in
Section VI, we collect and study the egonets (i.e., the one-
hop neighborhood) of all sampled users and use them to
estimate the clustering coefficient and assortativity, which are
topological properties. In Section VI-C, we use the sample of
users to estimate the topology at the coarser granularity of
countries.
The structure of the rest of paper is as follows. Section II
discusses related work. Section III describes the sampling
methodology, including the assumptions and limitations, the
candidate crawling techniques and the convergence diagnos-
tics. Section IV describes the data collection process, including
the implementation of high-performance crawlers, and the
collected data sets from Facebook. Section V evaluates
and compares all sampling techniques in terms of efficiency
(convergence of various node properties) and quality (bias) of
the obtained sample. Section VI provides a characterization of
some key Facebook properties, based on the MHRW sample.
Section VII concludes the paper. The appendices elaborate
on the following important points: (A) the uniform sample
obtained via userID rejection sampling, referred to as “ground
truth”; (B) the lack of temporal dynamics in Facebook, in the
timescale of our crawls and (C) comparison of the sampling
efficiency of MHRW vs. RWRW via analysis, simulation and
experimentation.
II. RELATED WORK
Broadly speaking, there are two types of work most closely
related to this paper: (i) sampling techniques, focusing on the
quality and efficiency of the sampling technique itself and (ii)
characterization studies, focusing on the properties of online
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


