Sign up & Download
Sign in

Gravitation-based model for information retrieval

by Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval SIGIR 05 (2005)

Cite this document (BETA)

Available from Shuming Shi's profile on Mendeley.
Page 1
hidden

Gravitation-based model for information retrieval

Gravitation-Based Model for Information Retrieval
Shuming Shi, Ji-Rong Wen, Qing Yu1, Ruihua Song, Wei-Ying Ma
Microsoft Research Asia, 49 Zhichun Road, Beijing, 100080, P.R. China
{t-shshi, jrwen, t-rsong, wyma}@microsoft.com
1Department of Computer Science, Beijing Institute of Technology, Beijing, P.R. China
1f-qyu@msrchina.researach.microsoft.com

ABSTRACT
This paper proposes GBM (gravitation-based model), a physical
model for information retrieval inspired by Newton’s theory of
gravitation. A mapping is built in this model from concepts of
information retrieval (documents, queries, relevance, etc) to those
of physics (mass, distance, radius, attractive force, etc). This
model actually provides a new perspective on IR problems. A
family of effective term weighting functions can be derived from
it, including the well-known BM25 formula. This model has some
advantages over most existing ones: First, because it is directly
based on basic physical laws, the derived formulas and algorithms
can have their explicit physical interpretation. Second, the ranking
formulas derived from this model satisfy more intuitive heuristics
than most of existing ones, thus have the potential to behave
empirically better and to be used safely on various settings.
Finally, a new approach for structured document retrieval derived
from this model is more reasonable and behaves better than
existing ones.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Retrieval models
General Terms
Algorithms, Experimentation, Theory
Keywords
Information retrieval models, Gravitation-based model, theory of
gravitation, mass estimation, structured document retrieval
1. INTRODUCTION
Information retrieval (IR) models, which define the representation
of documents, queries, and the relevance relationship between
them, are in a core position in information retrieval (IR). In the
past several decades, many categories of IR models (and their
variants) have been proposed and studied [2], including Boolean
models, vector space models [3][4], probabilistic and logic
models [10][14][6][1], and language models [12][13][7][24], etc.
The key behind all the models is the primary perspective on
information retrieval. The Boolean model views IR problems from
the perspective of set theory and Boolean algebra, while the
perspective used in the vector space model is vector and linear
algebra. Most of other categories of models take the probabilistic
perspective, which is the most dominating perspective on
information retrieval today.
It may be extremely hard to answer questions like “what is the
essence of information retrieval”, and “what is the right
perspective of it”. However, it is clear that, till now, we know
more about information retrieval each time when a new
perspective is adopted. It would also be helpful to view
information retrieval from more new perspectives.
Although many of the models (and the formulas and algorithms
derived from them) have been successfully applied to various
tasks, there are still some problems faced by them: First, the
retrieval formulas (formal or ad-hoc) conducted by most IR
models fail to satisfy even some basic intuitive heuristic
constraints [5]; Second, the retrieval formulas derived or
motivated from many IR models commonly lack intuitive
interpretations, especially physical interpretations. At the same
time, we are living in a physical world which is dominated by
fundamental physical laws. Can we get help from “the God” in
acquiring deeper understanding of information retrieval?
In this paper, we try to view information retrieval from the
perspective of physics, a quite different perspective from existing
ones. We propose a new framework which models documents,
queries, and their relationships using basic concepts in physics. In
particular, documents and queries are modeled as objects with
specific structures; and the relationship between a query and a
document is modeled as the attractive force between them. A
basic rule used here is Sir Isaac Newton’s theory of gravitation
(see Section2.1 for a brief introduction of it), a fundamental law
of the universe. The primary goal of the model is to help learning
more about information retrieval from a new perspective.
It is encouraging that we can really benefit from the nature. With
the new perspective and model, we get the following preliminary
achievements,
1. We have derived a family of effective ranking formulas
which satisfy all the heuristic constraints1 proposed in [5].
Experimental results show that these formulas are among
the most effective ranking functions proposed till now.
2. The BM25 term weighting function [9][11] can be easily
derived from our basic model, so we give an intuitive
physical interpretation of this powerful and robust function.

1 There is a small issue for the TDC constraint, which will be
discussed in Section 3.2.4.3.

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGIR’05, August 15-19, 2005, Salvador, Brazil.
Copyright 2005 ACM 1-59593-034-5/05/0008...$5.00.

488
Page 2
hidden
3. A more reasonable approach for structured document
retrieval can be obtained from the model. This approach is
not only highly effective but also robust to be used in
various conditions.
In this paper, we will examine the gravitation-based model
theoretically and empirically. We first give some background
knowledge and related work in section 2. In section 3, the GBM
model will be introduced and analyzed theoretically. The
performance of the model is tested by some experiments in section
4. We conclude the paper in Section 5.
2. BACKGROUND AND RELATED WORK
In this section, we will first give a brief introduction of Newton’s
theory of gravitation, upon which our model is built. And then
some related work is discussed.
2.1 Newton’s Theory of Gravitation
Gravitation is one of the four fundamental forces of nature. It
governs the motion of stars and plays a crucial role in most
processes on the earth. Newton proposed in 1687 his famous law
of gravitation which demonstrated that any two objects in the
universe have attractive force between them. For two particles
with masses m1, m2 respectively and with distance d between
them, the gravitational force between them can be expressed as
follows,
2 21d
mGmF = (2.1)
where G is a constant called the universal gravitation constant.
And the direction of the force is along a line between the two
particles. It can also be proved by calculus that the gravitational
force between two spheres can be viewed as all their masses are
concentrated in their centers.
2.2 Information Retrieval Perspectives and
Models
All IR models have their primary perspectives on information
retrieval problems. Some ranking formulas for retrieval are
commonly derived or motivated from the models. The term
weighting function, which defines the score of a document given
one query term, is the most important part of a ranking formula.
The followings are a brief overview of some categories of IR
models and the most effective term weighting functions for them.
2.2.1 Vector space model
In the vector space model [3][4], each document is represented as
a vector of terms, so does a query. And the relevance is measured
by the similarity (e.g. cosine of angle) between the query vector
and the document vector. Many term weighting functions have
been proposed upon this model (and its variants), among which
the pivoted normalization weighting formula [4] seems to be an
outstanding one,
)(
1ln||)1(
)),(ln(1ln(1),,( tdf
N
avdl
Dss
DtctQDw +⋅
+−
++
=

(2.2)
where s (between 0.0 to 1.0) is a parameter. Please see Table 1
(section 3.1) for the notations used in the above formula.
2.2.2 Probabilistic model
The probabilistic model [10][14][6][1] formulates the IR problem
in a probabilistic framework, which gives the relevance of a
document and a user query by estimating the probability that the
document is exactly what the user needs. Variants of probabilistic
models include Bayesian networks, inference network models,
belief network models, etc. Please refer to [8] for an overview of
them.
Okapi’s BM25 formula [9][11] is shown as one of the most
effective and robust ranking formulas in this category (and even in
all formulas till now). The term weighting function of its
commonly used simplification [9][11][16] is,
)(),()||)1((
),()1(),,(
1
1 tw
Dtcavdl
Dbbk
DtcktQDw ⋅
+⋅+−⋅
⋅+
=

(2.3)
Where k1 and b are parameters. The origin representation of w(t)
has the potential “negative IDF” issue, as has been discussed in
[5]. Like in [5], we will use ))(/)1ln(( tdfN + as the expression of
w(t) in the following part of the paper.
The above formula is first discovered by Robertson et al
[9][10][11], inspired by the shape of a complex formula derived
from a probabilistic model under the 2-Poisson assumption.
Amati and Rijsbergen propose in [1] a probabilistic framework for
generating nonparametric term weighting functions. They claim
that the BM25 function with some special parameters (k1=1.2,
b=0.75; or k1=2, b=0.75) can be approximated numerically by one
of their generated functions I(n)L2 (with k1=1 and 2 respectively).
By following these brilliant works, we try to derive BM25 from
our model and give it a physical explanation.
2.2.3 Language model
The language model [12][13][7][24] also adopts a probabilistic
framework. However, different from traditional probabilistic
models, it interprets the relevance between a document and a
query as the probability of generating the query from the
document’s model. Smoothing, which adjusts term probabilities
to overcome data sparseness, is critical to the performance of
language models. Among various smoothing methods, the
Dirichlet prior smoothing seems to be discussed frequently,
))|()1(||
),(ln(),,( CtPD
DtctQDw MLE⋅−+⋅= λλ (2.4)
where )|/(||| µλ += DD , and )|( CtPMLE is the maximum
likelihood estimate of the probability of term t in collection C.
And µ is a parameter whose value is commonly set to be
multiples of the average document length.
2.3 Structured Document Retrieval
As we will discuss in Section 3.3, our model can support
structured document retrieval naturally and effectively. So another
kind of work related to ours is structured document retrieval. A
document is said to be structured when it contains multiple fields.
Document’s field structure is commonly used to improve retrieval
performance in practice.
The most commonly used approach for structured document
retrieval may be score/rank (linear) combination [15][18][19][20],
which treats each field as a separate document and computes
scores/ranks for them. In computing scores for each field, any
489

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

9 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
56% Ph.D. Student
 
22% Student (Master)
 
22% Researcher (at a non-Academic Institution)
by Country
 
33% China
 
11% South Korea
 
11% Denmark