An information-theoretic measure for document similarity
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval SIGIR 03 (2003)
- ISBN: 1581136463
- DOI: 10.1145/860435.860545
Available from portal.acm.org
or
Available from portal.acm.org
Page 1
An information-theoretic measure for document similarity
An Information-theoretic Measure for Document Similarity ∗
Javed A. Aslam
Department of Computer Science
Dartmouth College
jaa@cs.dartmouth.edu
Meredith Frost
Department of Computer Science
Dartmouth College
Meredith.Frost@dartmouth.edu
ABSTRACT
Recent work has demonstrated that the assessment of pair-
wise object similarity can be approached in an axiomatic
manner using information theory. We extend this concept
specifically to document similarity and test the effective-
ness of an information-theoretic measure for pairwise docu-
ment similarity. We adapt query retrieval to rate the quality
of document similarity measures and demonstrate that our
proposed information-theoretic measure for document simi-
larity yields statistically significant improvements over other
popular measures of similarity.
Categories and Subject Descriptors:
H.3.3 [Information Search and Retrieval ]: Clustering
General Terms: Theory, Experimentation
Keywords: Similarity measures
1. INTRODUCTION
Measuring pairwise document similarity is quintessential
to various tasks in information retrieval, such as clustering
and some forms of query retrieval. It is therefore important
to calculate similarity as effectively as possible, and some
research exists comparing the quality of various similarity
measures in some contexts [4].
Dekang Lin [3] has investigated the theoretical basis of
similarity, and he derived the general form of an information-
theoretic measure for object similarity. Similarity may be
viewed as a question of how much information two objects
have in common and how much they have in difference. In-
formation theory provides a means for quantifying these in-
tuitive notions, being directly concerned with the mathe-
matical expression of information content.
Based on six axioms for similarity, Lin derived the follow-
ing general form for pairwise object similarity
IT-Sim(A,B) =
I(common(A,B))
I(description(A,B))
where I(common(A,B)) is the information content associ-
ated with the statement describing what A and B have in
∗This work partially supported by NSF Career Grant CCR-
0093131.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR’03, July 28–August 1, 2003, Toronto, Canada.
Copyright 2003 ACM 1-58113-646-3/03/0007 ...$5.00.
common and I(description(A,B)) is the information con-
tent associated with the statement describing A and B. The
information content of a statement x is defined by its self-
information log(1/pi(x)) [2] where pi(x) is the probability of
the statement within the world of the objects in question.
For objects which can be described by a set S of indepen-
dent features s, Lin derives the following instantiation of
this principle:
IT-Sim(A,B) =
2 ·
∑
s∈A∩B log pi(s)
∑
s∈A log pi(s) +
∑
s∈B log pi(s)
where pi(s) is the fraction of objects exhibiting feature s.
We may employ this methodology to assess the pairwise
similarity of documents if we assume, to a first approxima-
tion, that documents are composed of a set of independent
term “features.” The probability pi(t) is simply the frac-
tion of corpus documents containing term t, and we need
only generalize the above formulation to account for the
fact that “normalized” documents may contain a “fraction”
of a feature. For each document d and term t, let pd,t be
the fractional occurrence of term t in document d; thus,
∑
t pd,t = 1 for all d. Two (normalized) documents A and B
share min{pA,t, pB,t} amount of term t in “common,” while
they contain pA,t and pB,t amount of term t individually.
We may then infer the following
IT-Sim(A,B) =
2 ·
∑
t min{pA,t, pB,t} log pi(t)
∑
t pA,t log pi(t) +
∑
t pB,t log pi(t)
.
2. TESTING SIMILARITY MEASURES
We adapt the process of query retrieval in the TREC
competition to test the effectiveness of similarity measures.
Based on the assumption that relevant documents are more
similar to each other than to those that are non-relevant [5],
the technique is as follows:1
(1) For each document relevant to a query retrieval topic,
use each similarity measure to retrieve a ranked list
of the most similar documents. In essence, treat this
document as if it were a query.
(2) Obtain a measurement of the quality of the ranked
lists using the TREC evaluation program.
(3) Average the results for all docs within a query, then
for all queries, to yield a final number for each TREC
corpus.
1We employ the Porter stemmer and the SMART stop word
list to index our corpora.
Javed A. Aslam
Department of Computer Science
Dartmouth College
jaa@cs.dartmouth.edu
Meredith Frost
Department of Computer Science
Dartmouth College
Meredith.Frost@dartmouth.edu
ABSTRACT
Recent work has demonstrated that the assessment of pair-
wise object similarity can be approached in an axiomatic
manner using information theory. We extend this concept
specifically to document similarity and test the effective-
ness of an information-theoretic measure for pairwise docu-
ment similarity. We adapt query retrieval to rate the quality
of document similarity measures and demonstrate that our
proposed information-theoretic measure for document simi-
larity yields statistically significant improvements over other
popular measures of similarity.
Categories and Subject Descriptors:
H.3.3 [Information Search and Retrieval ]: Clustering
General Terms: Theory, Experimentation
Keywords: Similarity measures
1. INTRODUCTION
Measuring pairwise document similarity is quintessential
to various tasks in information retrieval, such as clustering
and some forms of query retrieval. It is therefore important
to calculate similarity as effectively as possible, and some
research exists comparing the quality of various similarity
measures in some contexts [4].
Dekang Lin [3] has investigated the theoretical basis of
similarity, and he derived the general form of an information-
theoretic measure for object similarity. Similarity may be
viewed as a question of how much information two objects
have in common and how much they have in difference. In-
formation theory provides a means for quantifying these in-
tuitive notions, being directly concerned with the mathe-
matical expression of information content.
Based on six axioms for similarity, Lin derived the follow-
ing general form for pairwise object similarity
IT-Sim(A,B) =
I(common(A,B))
I(description(A,B))
where I(common(A,B)) is the information content associ-
ated with the statement describing what A and B have in
∗This work partially supported by NSF Career Grant CCR-
0093131.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR’03, July 28–August 1, 2003, Toronto, Canada.
Copyright 2003 ACM 1-58113-646-3/03/0007 ...$5.00.
common and I(description(A,B)) is the information con-
tent associated with the statement describing A and B. The
information content of a statement x is defined by its self-
information log(1/pi(x)) [2] where pi(x) is the probability of
the statement within the world of the objects in question.
For objects which can be described by a set S of indepen-
dent features s, Lin derives the following instantiation of
this principle:
IT-Sim(A,B) =
2 ·
∑
s∈A∩B log pi(s)
∑
s∈A log pi(s) +
∑
s∈B log pi(s)
where pi(s) is the fraction of objects exhibiting feature s.
We may employ this methodology to assess the pairwise
similarity of documents if we assume, to a first approxima-
tion, that documents are composed of a set of independent
term “features.” The probability pi(t) is simply the frac-
tion of corpus documents containing term t, and we need
only generalize the above formulation to account for the
fact that “normalized” documents may contain a “fraction”
of a feature. For each document d and term t, let pd,t be
the fractional occurrence of term t in document d; thus,
∑
t pd,t = 1 for all d. Two (normalized) documents A and B
share min{pA,t, pB,t} amount of term t in “common,” while
they contain pA,t and pB,t amount of term t individually.
We may then infer the following
IT-Sim(A,B) =
2 ·
∑
t min{pA,t, pB,t} log pi(t)
∑
t pA,t log pi(t) +
∑
t pB,t log pi(t)
.
2. TESTING SIMILARITY MEASURES
We adapt the process of query retrieval in the TREC
competition to test the effectiveness of similarity measures.
Based on the assumption that relevant documents are more
similar to each other than to those that are non-relevant [5],
the technique is as follows:1
(1) For each document relevant to a query retrieval topic,
use each similarity measure to retrieve a ranked list
of the most similar documents. In essence, treat this
document as if it were a query.
(2) Obtain a measurement of the quality of the ranked
lists using the TREC evaluation program.
(3) Average the results for all docs within a query, then
for all queries, to yield a final number for each TREC
corpus.
1We employ the Porter stemmer and the SMART stop word
list to index our corpora.
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
4 Readers on Mendeley
by Discipline
25% Psychology
by Academic Status
50% Ph.D. Student
25% Doctoral Student
25% Post Doc
by Country
50% Germany
25% Spain
25% United States


