Pairwise document similarity in large collections with MapReduce
Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies Short Papers HLT 08 (2008)
Available from portal.acm.org
or
Abstract
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a col- lection consisting of approximately 900,000 newswire articles, our algorithm exhibits lin- ear growth in running time and space in terms of the number of documents.
Available from portal.acm.org
Page 1
Pairwise document similarity in large collections with MapReduce
Proceedings of ACL-08: HLT, Short Papers (Companion Volume), pages 265–268,
Columbus, Ohio, USA, June 2008. c©2008 Association for Computational Linguistics
Pairwise Document Similarity in Large Collections with MapReduce
Tamer Elsayed,∗Jimmy Lin,† and Douglas W. Oard†
Human Language Technology Center of Excellence and
UMIACS Laboratory for Computational Linguistics and Information Processing
University of Maryland, College Park, MD 20742
{telsayed,jimmylin,oard}@umd.edu
Abstract
This paper presents a MapReduce algorithm
for computing pairwise document similarity
in large document collections. MapReduce is
an attractive framework because it allows us
to decompose the inner products involved in
computing document similarity into separate
multiplication and summation stages in a way
that is well matched to efficient disk access
patterns across several machines. On a col-
lection consisting of approximately 900,000
newswire articles, our algorithm exhibits lin-
ear growth in running time and space in terms
of the number of documents.
1 Introduction
Computing pairwise similarity on large document
collections is a task common to a variety of prob-
lems such as clustering and cross-document coref-
erence resolution. For example, in the PubMed
search engine,1 which provides access to the life sci-
ences literature, a “more like this” browsing feature
is implemented as a simple lookup of document-
document similarity scores, computed offline. This
paper considers a large class of similarity functions
that can be expressed as an inner product of term
weight vectors.
For document collections that fit into random-
access memory, the solution is straightforward. As
collection size grows, however, it ultimately be-
comes necessary to resort to disk storage, at which
point aligning computation order with disk access
patterns becomes a challenge. Further growth in the
∗Department of Computer Science
†The iSchool, College of Information Studies
1http://www.ncbi.nlm.nih.gov/PubMed
document collection will ultimately make it desir-
able to spread the computation over several proces-
sors, at which point interprocess communication be-
comes a second potential bottleneck for which the
computation order must be optimized. Although
tailored implementations can be designed for spe-
cific parallel processing architectures, the MapRe-
duce framework (Dean and Ghemawat, 2004) offers
an attractive solution to these challenges. In this pa-
per, we describe how pairwise similarity computa-
tion for large collections can be efficiently imple-
mented with MapReduce. We empirically demon-
strate that removing high frequency (and therefore
low entropy) terms results in approximately linear
growth in required disk space and running time with
increasing collection size for collections containing
several hundred thousand documents.
2 MapReduce Framework
MapReduce builds on the observation that many
tasks have the same structure: a computation is ap-
plied over a large number of records (e.g., docu-
ments) to generate partial results, which are then ag-
gregated in some fashion. Naturally, the per-record
computation and aggregation vary by task, but the
basic structure remains fixed. Taking inspiration
from higher-order functions in functional program-
ming, MapReduce provides an abstraction that in-
volves the programmer defining a “mapper” and a
“reducer”, with the following signatures:
map: (k1, v1)→ [(k2, v2)]
reduce: (k2, [v2])→ [(k3, v3)]
Key/value pairs form the basic data structure in
MapReduce. The “mapper” is applied to every input
265
Columbus, Ohio, USA, June 2008. c©2008 Association for Computational Linguistics
Pairwise Document Similarity in Large Collections with MapReduce
Tamer Elsayed,∗Jimmy Lin,† and Douglas W. Oard†
Human Language Technology Center of Excellence and
UMIACS Laboratory for Computational Linguistics and Information Processing
University of Maryland, College Park, MD 20742
{telsayed,jimmylin,oard}@umd.edu
Abstract
This paper presents a MapReduce algorithm
for computing pairwise document similarity
in large document collections. MapReduce is
an attractive framework because it allows us
to decompose the inner products involved in
computing document similarity into separate
multiplication and summation stages in a way
that is well matched to efficient disk access
patterns across several machines. On a col-
lection consisting of approximately 900,000
newswire articles, our algorithm exhibits lin-
ear growth in running time and space in terms
of the number of documents.
1 Introduction
Computing pairwise similarity on large document
collections is a task common to a variety of prob-
lems such as clustering and cross-document coref-
erence resolution. For example, in the PubMed
search engine,1 which provides access to the life sci-
ences literature, a “more like this” browsing feature
is implemented as a simple lookup of document-
document similarity scores, computed offline. This
paper considers a large class of similarity functions
that can be expressed as an inner product of term
weight vectors.
For document collections that fit into random-
access memory, the solution is straightforward. As
collection size grows, however, it ultimately be-
comes necessary to resort to disk storage, at which
point aligning computation order with disk access
patterns becomes a challenge. Further growth in the
∗Department of Computer Science
†The iSchool, College of Information Studies
1http://www.ncbi.nlm.nih.gov/PubMed
document collection will ultimately make it desir-
able to spread the computation over several proces-
sors, at which point interprocess communication be-
comes a second potential bottleneck for which the
computation order must be optimized. Although
tailored implementations can be designed for spe-
cific parallel processing architectures, the MapRe-
duce framework (Dean and Ghemawat, 2004) offers
an attractive solution to these challenges. In this pa-
per, we describe how pairwise similarity computa-
tion for large collections can be efficiently imple-
mented with MapReduce. We empirically demon-
strate that removing high frequency (and therefore
low entropy) terms results in approximately linear
growth in required disk space and running time with
increasing collection size for collections containing
several hundred thousand documents.
2 MapReduce Framework
MapReduce builds on the observation that many
tasks have the same structure: a computation is ap-
plied over a large number of records (e.g., docu-
ments) to generate partial results, which are then ag-
gregated in some fashion. Naturally, the per-record
computation and aggregation vary by task, but the
basic structure remains fixed. Taking inspiration
from higher-order functions in functional program-
ming, MapReduce provides an abstraction that in-
volves the programmer defining a “mapper” and a
“reducer”, with the following signatures:
map: (k1, v1)→ [(k2, v2)]
reduce: (k2, [v2])→ [(k3, v3)]
Key/value pairs form the basic data structure in
MapReduce. The “mapper” is applied to every input
265
Page 2
Shu
fflin
g: g
rou
p va
lues
by
key
s
ma
p
ma
p
ma
p
ma
p
red
uce
red
uce
red
uce
inp
ut
inp
ut
inp
ut
inp
ut
out
put
out
put
out
put
Figure 1: Illustration of the MapReduce framework: the
“mapper” is applied to all input records, which generates
results that are aggregated by the “reducer”.
key/value pair to generate an arbitrary number of in-
termediate key/value pairs. The “reducer” is applied
to all values associated with the same intermediate
key to generate output key/value pairs (see Figure 1).
On top of a distributed file system (Ghemawat
et al., 2003), the runtime transparently handles all
other aspects of execution (e.g., scheduling and fault
tolerance), on clusters ranging from a few to a few
thousand nodes. MapReduce is an attractive frame-
work because it shields the programmer from dis-
tributed processing issues such as synchronization,
data exchange, and load balancing.
3 Pairwise Document Similarity
Our work focuses on a large class of document simi-
larity metrics that can be expressed as an inner prod-
uct of term weights. A document d is represented as
a vector Wd of term weights wt,d, which indicate
the importance of each term t in the document, ig-
noring the relative ordering of terms (“bag of words”
model). We consider symmetric similarity measures
defined as follows:
sim(di, dj) =
∑
t∈V
wt,di · wt,dj (1)
where sim(di, dj) is the similarity between docu-
ments di and dj and V is the vocabulary set. In this
type of similarity measure, a term will contribute to
the similarity between two documents only if it has
non-zero weights in both. Therefore, t ∈ V can be
replaced with t ∈ di ∩ dj in equation 1.
Generalizing this to the problem of computing
similarity between all pairs of documents, we note
Algorithm 1 Compute Pairwise Similarity Matrix
1: ∀i, j : sim[i, j]⇐ 0
2: for all t ∈ V do
3: pt ⇐ postings(t)
4: for all di, dj ∈ pt do
5: sim[i, j]⇐ sim[i, j] + wt,di · wt,dj
that a term contributes to each pair that contains it.2
For example, if a term appears in documents x, y,
and z, it contributes only to the similarity scores be-
tween (x, y), (x, z), and (y, z). The list of docu-
ments that contain a particular term is exactly what
is contained in the postings of an inverted index.
Thus, by processing all postings, we can compute
the entire pairwise similarity matrix by summing
term contributions.
Algorithm 1 formalizes this idea: postings(t) de-
notes the list of documents that contain term t. For
simplicity, we assume that term weights are also
stored in the postings. For small collections, this al-
gorithm can be run efficiently to compute the entire
similarity matrix in memory. For larger collections,
disk access optimization is needed—which is pro-
vided by the MapReduce runtime, without requiring
explicit coordination.
We propose an efficient solution to the pairwise
document similarity problem, expressed as two sep-
arate MapReduce jobs (illustrated in Figure 2):
1) Indexing: We build a standard inverted in-
dex (Frakes and Baeza-Yates, 1992), where each
term is associated with a list of docid’s for docu-
ments that contain it and the associated term weight.
Mapping over all documents, the mapper, for each
term in the document, emits the term as the key, and
a tuple consisting of the docid and term weight as the
value. The MapReduce runtime automatically han-
dles the grouping of these tuples, which the reducer
then writes out to disk, thus generating the postings.
2) Pairwise Similarity: Mapping over each post-
ing, the mapper generates key tuples corresponding
to pairs of docids in the postings: in total, 12m(m−1)
pairs where m is the posting length. These key tu-
ples are associated with the product of the corre-
sponding term weights—they represent the individ-
2Actually, since we focus on symmetric similarity functions,
we only need to compute half the pairs.
266
fflin
g: g
rou
p va
lues
by
key
s
ma
p
ma
p
ma
p
ma
p
red
uce
red
uce
red
uce
inp
ut
inp
ut
inp
ut
inp
ut
out
put
out
put
out
put
Figure 1: Illustration of the MapReduce framework: the
“mapper” is applied to all input records, which generates
results that are aggregated by the “reducer”.
key/value pair to generate an arbitrary number of in-
termediate key/value pairs. The “reducer” is applied
to all values associated with the same intermediate
key to generate output key/value pairs (see Figure 1).
On top of a distributed file system (Ghemawat
et al., 2003), the runtime transparently handles all
other aspects of execution (e.g., scheduling and fault
tolerance), on clusters ranging from a few to a few
thousand nodes. MapReduce is an attractive frame-
work because it shields the programmer from dis-
tributed processing issues such as synchronization,
data exchange, and load balancing.
3 Pairwise Document Similarity
Our work focuses on a large class of document simi-
larity metrics that can be expressed as an inner prod-
uct of term weights. A document d is represented as
a vector Wd of term weights wt,d, which indicate
the importance of each term t in the document, ig-
noring the relative ordering of terms (“bag of words”
model). We consider symmetric similarity measures
defined as follows:
sim(di, dj) =
∑
t∈V
wt,di · wt,dj (1)
where sim(di, dj) is the similarity between docu-
ments di and dj and V is the vocabulary set. In this
type of similarity measure, a term will contribute to
the similarity between two documents only if it has
non-zero weights in both. Therefore, t ∈ V can be
replaced with t ∈ di ∩ dj in equation 1.
Generalizing this to the problem of computing
similarity between all pairs of documents, we note
Algorithm 1 Compute Pairwise Similarity Matrix
1: ∀i, j : sim[i, j]⇐ 0
2: for all t ∈ V do
3: pt ⇐ postings(t)
4: for all di, dj ∈ pt do
5: sim[i, j]⇐ sim[i, j] + wt,di · wt,dj
that a term contributes to each pair that contains it.2
For example, if a term appears in documents x, y,
and z, it contributes only to the similarity scores be-
tween (x, y), (x, z), and (y, z). The list of docu-
ments that contain a particular term is exactly what
is contained in the postings of an inverted index.
Thus, by processing all postings, we can compute
the entire pairwise similarity matrix by summing
term contributions.
Algorithm 1 formalizes this idea: postings(t) de-
notes the list of documents that contain term t. For
simplicity, we assume that term weights are also
stored in the postings. For small collections, this al-
gorithm can be run efficiently to compute the entire
similarity matrix in memory. For larger collections,
disk access optimization is needed—which is pro-
vided by the MapReduce runtime, without requiring
explicit coordination.
We propose an efficient solution to the pairwise
document similarity problem, expressed as two sep-
arate MapReduce jobs (illustrated in Figure 2):
1) Indexing: We build a standard inverted in-
dex (Frakes and Baeza-Yates, 1992), where each
term is associated with a list of docid’s for docu-
ments that contain it and the associated term weight.
Mapping over all documents, the mapper, for each
term in the document, emits the term as the key, and
a tuple consisting of the docid and term weight as the
value. The MapReduce runtime automatically han-
dles the grouping of these tuples, which the reducer
then writes out to disk, thus generating the postings.
2) Pairwise Similarity: Mapping over each post-
ing, the mapper generates key tuples corresponding
to pairs of docids in the postings: in total, 12m(m−1)
pairs where m is the posting length. These key tu-
ples are associated with the product of the corre-
sponding term weights—they represent the individ-
2Actually, since we focus on symmetric similarity functions,
we only need to compute half the pairs.
266
Page 3
d 1
(A,(
d 1,
2))
(B,(
d 1,
1))
(C,(
d 1,
1))
(B,(
d 2,
1))
(D,(
d 2,
2))
(A,(
d 3,
1))
(B,(
d 3,
2))
(E,(
d 3,
1))
(A,[
(d 1,
2),
(d 3,
1)])
(B,[
(d 1,
1), (d 2,
1),
(d 3,
2)])
(C,[
(d 1,
1)])
(D,[
(d 2,
2)])
(E,[
(d 3,
1)])
d 2 d 3
((d 1
,d 3
),2)
((d 1
,d 2
),1)
((d 1
,d 3
),2)
((d 2
,d 3
),2)
((d 1
,d 2
),[1]
)
((d 1
,d 3
),[2, 2
])
((d 2
,d 3
),[2]
)
((d 1
,d 2
),1)
((d 1
,d 3
),4)
((d 2
,d 3
),2)
“A
A B
C”
“B
D D
”
“A
B B
E”
ma
p
ma
p
ma
p
re
duc
e
re
duc
e
re
duc
e
ma
p
ma
p
ma
p
shu
ffle
ma
p
ma
p
shu
ffle
Ind
ex
ing
Pa
irw
ise
Sim
ilar
ity
re
duc
e
re
duc
e
re
duc
e
re
duc
e
re
duc
e
(A,[
(d 1,
2),
(d 3,
1)])
(B,[
(d 1,
1), (d 2,
1),
(d 3,
2)])
(C,[
(d 1,
1)])
(D,[
(d 2,
2)])
(E,[
(d 3,
1)])
Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d =
tft,d) is chosen for illustration.
ual term contributions to the final inner product. The
MapReduce runtime sorts the tuples and then the re-
ducer sums all the individual score contributions for
a pair to generate the final similarity score.
4 Experimental Evaluation
In our experiments, we used Hadoop ver-
sion 0.16.0,3 an open-source Java implementation
of MapReduce, running on a cluster with 20 ma-
chines (1 master, 19 slave). Each machine has two
single-core processors (running at either 2.4GHz or
2.8GHz), 4GB memory, and 100GB disk.
We implemented the symmetric variant of Okapi-
BM25 (Olsson and Oard, 2007) as the similarity
function. We used the AQUAINT-2 collection of
newswire text, containing 906k documents, totaling
approximately 2.5 gigabytes. Terms were stemmed.
To test the scalability of our technique, we sampled
the collection into subsets of 10, 20, 25, 50, 67, 75,
80, 90, and 100 percent of the documents.
After stopword removal (using Lucene’s stop-
word list), we implemented a df-cut, where a frac-
tion of the terms with the highest document frequen-
cies is eliminated.4 This has the effect of remov-
ing non-discriminative terms. In our experiments,
we adopt a 99% cut, which means that the most fre-
quent 1% of terms were discarded (9,093 terms out
of a total vocabulary size of 909,326). This tech-
nique greatly increases the efficiency of our algo-
rithm, since the number of tuples emitted by the
3http://hadoop.apache.org/
4In text classification, removal of rare terms is more com-
mon. Here we use df-cut to remove common terms.
R2 = 0.
997
020406080100120140 0
10
20
30
40
50
60
70
80
90
100
Corpu
s Size
(%)
Computation Time (minutes)
Figure 3: Running time of pairwise similarity compar-
isons, for subsets of AQUAINT-2.
mappers in the pairwise similarity phase is domi-
nated by the length of the longest posting (in the
worst case, if a term appears in all documents, it
would generate approximately 1012 tuples).
Figure 3 shows the running time of the pairwise
similarity phase for different collection sizes.5 The
computation for the entire collection finishes in ap-
proximately two hours. Empirically, we find that
running time increases linearly with collection size,
which is an extremely desirable property. To get a
sense of the space complexity, we compute the num-
ber of intermediate document pairs that are emit-
ted by the mappers. The space savings are large
(3.7 billion rather than 8.1 trillion intermediate pairs
for the entire collection), and space requirements
grow linearly with collection size over this region
(R2 = 0.9975).
5The entire collection was indexed in about 3.5 minutes.
267
(A,(
d 1,
2))
(B,(
d 1,
1))
(C,(
d 1,
1))
(B,(
d 2,
1))
(D,(
d 2,
2))
(A,(
d 3,
1))
(B,(
d 3,
2))
(E,(
d 3,
1))
(A,[
(d 1,
2),
(d 3,
1)])
(B,[
(d 1,
1), (d 2,
1),
(d 3,
2)])
(C,[
(d 1,
1)])
(D,[
(d 2,
2)])
(E,[
(d 3,
1)])
d 2 d 3
((d 1
,d 3
),2)
((d 1
,d 2
),1)
((d 1
,d 3
),2)
((d 2
,d 3
),2)
((d 1
,d 2
),[1]
)
((d 1
,d 3
),[2, 2
])
((d 2
,d 3
),[2]
)
((d 1
,d 2
),1)
((d 1
,d 3
),4)
((d 2
,d 3
),2)
“A
A B
C”
“B
D D
”
“A
B B
E”
ma
p
ma
p
ma
p
re
duc
e
re
duc
e
re
duc
e
ma
p
ma
p
ma
p
shu
ffle
ma
p
ma
p
shu
ffle
Ind
ex
ing
Pa
irw
ise
Sim
ilar
ity
re
duc
e
re
duc
e
re
duc
e
re
duc
e
re
duc
e
(A,[
(d 1,
2),
(d 3,
1)])
(B,[
(d 1,
1), (d 2,
1),
(d 3,
2)])
(C,[
(d 1,
1)])
(D,[
(d 2,
2)])
(E,[
(d 3,
1)])
Figure 2: Computing pairwise similarity of a toy collection of 3 documents. A simple term weighting scheme (wt,d =
tft,d) is chosen for illustration.
ual term contributions to the final inner product. The
MapReduce runtime sorts the tuples and then the re-
ducer sums all the individual score contributions for
a pair to generate the final similarity score.
4 Experimental Evaluation
In our experiments, we used Hadoop ver-
sion 0.16.0,3 an open-source Java implementation
of MapReduce, running on a cluster with 20 ma-
chines (1 master, 19 slave). Each machine has two
single-core processors (running at either 2.4GHz or
2.8GHz), 4GB memory, and 100GB disk.
We implemented the symmetric variant of Okapi-
BM25 (Olsson and Oard, 2007) as the similarity
function. We used the AQUAINT-2 collection of
newswire text, containing 906k documents, totaling
approximately 2.5 gigabytes. Terms were stemmed.
To test the scalability of our technique, we sampled
the collection into subsets of 10, 20, 25, 50, 67, 75,
80, 90, and 100 percent of the documents.
After stopword removal (using Lucene’s stop-
word list), we implemented a df-cut, where a frac-
tion of the terms with the highest document frequen-
cies is eliminated.4 This has the effect of remov-
ing non-discriminative terms. In our experiments,
we adopt a 99% cut, which means that the most fre-
quent 1% of terms were discarded (9,093 terms out
of a total vocabulary size of 909,326). This tech-
nique greatly increases the efficiency of our algo-
rithm, since the number of tuples emitted by the
3http://hadoop.apache.org/
4In text classification, removal of rare terms is more com-
mon. Here we use df-cut to remove common terms.
R2 = 0.
997
020406080100120140 0
10
20
30
40
50
60
70
80
90
100
Corpu
s Size
(%)
Computation Time (minutes)
Figure 3: Running time of pairwise similarity compar-
isons, for subsets of AQUAINT-2.
mappers in the pairwise similarity phase is domi-
nated by the length of the longest posting (in the
worst case, if a term appears in all documents, it
would generate approximately 1012 tuples).
Figure 3 shows the running time of the pairwise
similarity phase for different collection sizes.5 The
computation for the entire collection finishes in ap-
proximately two hours. Empirically, we find that
running time increases linearly with collection size,
which is an extremely desirable property. To get a
sense of the space complexity, we compute the num-
ber of intermediate document pairs that are emit-
ted by the mappers. The space savings are large
(3.7 billion rather than 8.1 trillion intermediate pairs
for the entire collection), and space requirements
grow linearly with collection size over this region
(R2 = 0.9975).
5The entire collection was indexed in about 3.5 minutes.
267
Page 4
01,0002,0003,0004,0005,0006,0007,0008,0009,000
0
10
20
30
40
50
60
70
80
90
100
Corpu
s Size
(%)
Intermediate Pairs (billions)
df-cut
at 99%
df-cut
at 99.9
%
df-cut
at 99.9
9%
df-cut
at 99.9
99%
no df-
cut
Figure 4: Effect of changing df -cut thresholds on the
number of intermediate document-pairs emitted, for sub-
sets of AQUAINT-2.
5 Discussion and Future Work
In addition to empirical results, it would be desir-
able to derive an analytical model of our algorithm’s
complexity. Here we present a preliminary sketch of
such an analysis and discuss its implications. The
complexity of our pairwise similarity algorithm is
tied to the number of document pairs that are emit-
ted by the mapper, which equals the total number of
products required in O(N2) inner products, where
N is the collection size. This is equal to:
1
2
∑
t∈V
dft(dft − 1) (2)
where dft is the document frequency, or equivalently
the length of the postings for term t. Given that to-
kens in natural language generally obey Zipf’s Law,
and vocabulary size and collection size can be re-
lated via Heap’s Law, it may be possible to develop
a closed form approximation to the above series.
Given the necessity of computing O(N2) inner
products, it may come as a surprise that empirically
our algorithm scales linearly (at least for the collec-
tion sizes we explored). We believe that the key to
this behavior is our df-cut technique, which elimi-
nates the head of the df distribution. In our case,
eliminating the top 1% of terms reduces the number
of document pairs by several orders of magnitude.
However, the impact of this technique on effective-
ness (e.g., in a query-by-example experiment) has
not yet been characterized. Indeed, a df-cut thresh-
old of 99% might seem rather aggressive, removing
meaning-bearing terms such as “arthritis” and “Cor-
nell” in addition to perhaps less problematic terms
such as “sleek” and “frail.” But redundant use of
related terms is common in news stories, which we
would expect to reduce the adverse effect on many
applications of removing these low entropy terms.
Moreover, as Figure 4 illustrates, relaxing the df-
cut to a 99.9% threshold still results in approxi-
mately linear growth in the requirement for interme-
diate storage (at least over this region).6 In essence,
optimizing the df-cut is an efficiency vs. effective-
ness tradeoff that is best made in the context of a
specific application. Finally, we note that alternative
approaches to similar problems based on locality-
sensitive hashing (Andoni and Indyk, 2008) face
similar tradeoffs in tuning for a particular false pos-
itive rate; cf. (Bayardo et al., 2007).
6 Conclusion
We present a MapReduce algorithm for efficiently
computing pairwise document similarity in large
document collections. In addition to offering spe-
cific benefits for a number of real-world tasks, we
also believe that our work provides an example of
a programming paradigm that could be useful for a
broad range of text analysis problems.
Acknowledgments
This work was supported in part by the Intramural
Research Program of the NIH/NLM/NCBI.
References
A. Andoni and P. Indyk. 2008. Near-optimal hashing
algorithms for approximate nearest neighbor in high
dimensions. CACM, 51(1):117–122.
R. Bayardo, Y. Ma, and R. Srikant. 2007. Scaling up all
pairs similarity search. In WWW ’07.
J. Dean and S. Ghemawat. 2004. MapReduce: Simpli-
fied data processing on large clusters. In OSDI ’04.
W. Frakes and R. Baeza-Yates. 1992. Information Re-
trieval: Data Structures and Algorithms.
S. Ghemawat, H. Gobioff, and S. Leung. 2003. The
Google File System. In SOSP ’03.
J. Olsson and D. Oard. 2007. Improving text classifi-
cation for oral history archives with temporal domain
knowledge. In SIGIR ’07.
6More recent experiments suggest that a df-cut of 99.9% re-
sults in almost no loss of effectiveness on a query-by-example
task, compared to no df-cut.
268
0
10
20
30
40
50
60
70
80
90
100
Corpu
s Size
(%)
Intermediate Pairs (billions)
df-cut
at 99%
df-cut
at 99.9
%
df-cut
at 99.9
9%
df-cut
at 99.9
99%
no df-
cut
Figure 4: Effect of changing df -cut thresholds on the
number of intermediate document-pairs emitted, for sub-
sets of AQUAINT-2.
5 Discussion and Future Work
In addition to empirical results, it would be desir-
able to derive an analytical model of our algorithm’s
complexity. Here we present a preliminary sketch of
such an analysis and discuss its implications. The
complexity of our pairwise similarity algorithm is
tied to the number of document pairs that are emit-
ted by the mapper, which equals the total number of
products required in O(N2) inner products, where
N is the collection size. This is equal to:
1
2
∑
t∈V
dft(dft − 1) (2)
where dft is the document frequency, or equivalently
the length of the postings for term t. Given that to-
kens in natural language generally obey Zipf’s Law,
and vocabulary size and collection size can be re-
lated via Heap’s Law, it may be possible to develop
a closed form approximation to the above series.
Given the necessity of computing O(N2) inner
products, it may come as a surprise that empirically
our algorithm scales linearly (at least for the collec-
tion sizes we explored). We believe that the key to
this behavior is our df-cut technique, which elimi-
nates the head of the df distribution. In our case,
eliminating the top 1% of terms reduces the number
of document pairs by several orders of magnitude.
However, the impact of this technique on effective-
ness (e.g., in a query-by-example experiment) has
not yet been characterized. Indeed, a df-cut thresh-
old of 99% might seem rather aggressive, removing
meaning-bearing terms such as “arthritis” and “Cor-
nell” in addition to perhaps less problematic terms
such as “sleek” and “frail.” But redundant use of
related terms is common in news stories, which we
would expect to reduce the adverse effect on many
applications of removing these low entropy terms.
Moreover, as Figure 4 illustrates, relaxing the df-
cut to a 99.9% threshold still results in approxi-
mately linear growth in the requirement for interme-
diate storage (at least over this region).6 In essence,
optimizing the df-cut is an efficiency vs. effective-
ness tradeoff that is best made in the context of a
specific application. Finally, we note that alternative
approaches to similar problems based on locality-
sensitive hashing (Andoni and Indyk, 2008) face
similar tradeoffs in tuning for a particular false pos-
itive rate; cf. (Bayardo et al., 2007).
6 Conclusion
We present a MapReduce algorithm for efficiently
computing pairwise document similarity in large
document collections. In addition to offering spe-
cific benefits for a number of real-world tasks, we
also believe that our work provides an example of
a programming paradigm that could be useful for a
broad range of text analysis problems.
Acknowledgments
This work was supported in part by the Intramural
Research Program of the NIH/NLM/NCBI.
References
A. Andoni and P. Indyk. 2008. Near-optimal hashing
algorithms for approximate nearest neighbor in high
dimensions. CACM, 51(1):117–122.
R. Bayardo, Y. Ma, and R. Srikant. 2007. Scaling up all
pairs similarity search. In WWW ’07.
J. Dean and S. Ghemawat. 2004. MapReduce: Simpli-
fied data processing on large clusters. In OSDI ’04.
W. Frakes and R. Baeza-Yates. 1992. Information Re-
trieval: Data Structures and Algorithms.
S. Ghemawat, H. Gobioff, and S. Leung. 2003. The
Google File System. In SOSP ’03.
J. Olsson and D. Oard. 2007. Improving text classifi-
cation for oral history archives with temporal domain
knowledge. In SIGIR ’07.
6More recent experiments suggest that a df-cut of 99.9% re-
sults in almost no loss of effectiveness on a query-by-example
task, compared to no df-cut.
268
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
69 Readers on Mendeley
by Discipline
3% Mathematics
by Academic Status
36% Ph.D. Student
13% Other Professional
12% Student (Master)
by Country
25% United States
12% China
12% Germany



