Effective top-k computation in retrieving structured documents with term-proximity support
Science And Technology (2007)
- ISBN: 9781595938039
- DOI: 10.1145/1321440.1321547
Available from
Shuming Shi's profile on Mendeley.
or
Author-supplied keywords
Available from
Shuming Shi's profile on Mendeley.
Page 1
Effective top-k computation in retrieving structured documents with term-proximity support
Effective Top-K Computation in Retrieving Structured
Documents with Term-Proximity Support
Mingjie Zhu2*, Shuming Shi1, Mingjing Li1, Ji-Rong Wen1
Microsoft Research Asia, Beijing, China1
University of Science and Technology of China, Hefei, Anhui, China2
{shumings, mjli, jrwen}@microsoft.com1, mjzhu@ustc.edu2
ABSTRACT
Modern web search engines are expected to return top-k results
efficiently given a query. Although many dynamic index pruning
strategies have been proposed for efficient top-k computation,
most of them are prone to ignore some especially important
factors in ranking functions, e.g. term proximity (the distance
relationship between query terms in a document). The inclusion of
term proximity breaks the monotonicity of ranking functions and
therefore leads to additional challenges for efficient query
processing. This paper studies the performance of some existing
top-k computation approaches using term-proximity-enabled
ranking functions. Our investigation demonstrates that, when term
proximity is incorporated into ranking functions, most existing
index structures and top-k strategies become quite inefficient.
According to our analysis and experimental results, we propose
two index structures and their corresponding index pruning
strategies: Structured and Hybrid, which performs much better on
the new settings. Moreover, the efficiency of index building and
maintenance would not be affected too much with the two
approaches.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Search process;
H.3.4 [Systems and Software]: Performance evaluation
(efficiency and effectiveness)
General Terms
Algorithms, Performance, Experimentation
Keywords
Top-k, Dynamic index pruning, Term proximity, Document
structure, Hybrid index structure
1. INTRODUCTION
Major commercial web search engines [12, 17, 27] have grown to
index billions of pages. In spite of such large amount of data, end
users expect search results being retrieved quickly and with high
accuracy. In web search, as top few results instead of all relevant
documents are often required, it is possible to speed up the
searching process by skipping the relevance computation of some
documents. Various dynamic index pruning mechanisms [1, 2, 15]
have been proposed for this purpose.
The following kind of ranking functions are assumed in most
papers studying efficient top-k computation,
(1.1)
where F(D,Q) represents the overall relevance score of document
D to query Q, G(D) is the (query-independent) static rank (e.g.
PageRank [4]) of the document, T(D,Q) is the overall term score
of D to Q, and T(D,t) is the term-score of D and query term t,
computed via a term-weighting function (e.g. BM25 [22]). And
, and are parameters satisfying and
.
The above formula is monotonic with respect to document static
rank G(D) and term score T(D,t). This property is utilized by
existing approaches to estimate the score upper bound of unseen
documents, and therefore to skip the score computation of some
lower-score documents (refer to section 2.3).
In addition to those factors in Formula 1.1, the search quality of
modern search engines depends heavily on some other evidence:
document structure, anchor-text, and term proximity. Document
structure means that a web page often comprises multiple fields
(title, URL, body text, etc). An appropriate usage of document
field structure can improve search results effectively, as we know
terms appearing in special fields like title, URL are generally
more important. Anchor text is a piece of clickable text that links
to a target web page. As anchor text aggregates the (relatively
objective) opinion of potentially a large number of other pages, it
actually acts as a special important field of a web page. Finally,
term proximity demonstrates how close query terms appear in a
document. Intuitively, one document with high term proximity
values (i.e. query terms are near in the document) should be more
relevant to the query than another document in which query terms
are far away from one another, if other factors are the same for the
two documents. Take query “knowledge management” as an
example. It is clear that quite a lot of pages on the web containing
both “knowledge” and “management” are actually irrelevant to
knowledge management. While if term “knowledge” appears in a
document followed by term “management”, the document will be
much probably related to knowledge management. Term
proximity is more important in web search than in traditional
information retrieval systems, due to the fact that there are usually
millions of relevant results to a query.
Due to the importance of term proximity and document structure
in modern commercial search engines, effective top-k
computation approaches should be studied by considering these
factors. Unfortunately, although quite a few indexing pruning
strategies have been proposed, few (if have) of them consider
term proximity in the ranking functions. This motivates us to
study the effectiveness of existing approaches when ranking
functions are fortified with document structure and term proximity
* This work was done when the author was an intern at Microsoft Research Asia
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
CIKM’07, November 6-8, 2007, Lisboa, Portugal.
Copyright 2007 ACM 978-1-59593-803-9/07/0011...$5.00.
771
Documents with Term-Proximity Support
Mingjie Zhu2*, Shuming Shi1, Mingjing Li1, Ji-Rong Wen1
Microsoft Research Asia, Beijing, China1
University of Science and Technology of China, Hefei, Anhui, China2
{shumings, mjli, jrwen}@microsoft.com1, mjzhu@ustc.edu2
ABSTRACT
Modern web search engines are expected to return top-k results
efficiently given a query. Although many dynamic index pruning
strategies have been proposed for efficient top-k computation,
most of them are prone to ignore some especially important
factors in ranking functions, e.g. term proximity (the distance
relationship between query terms in a document). The inclusion of
term proximity breaks the monotonicity of ranking functions and
therefore leads to additional challenges for efficient query
processing. This paper studies the performance of some existing
top-k computation approaches using term-proximity-enabled
ranking functions. Our investigation demonstrates that, when term
proximity is incorporated into ranking functions, most existing
index structures and top-k strategies become quite inefficient.
According to our analysis and experimental results, we propose
two index structures and their corresponding index pruning
strategies: Structured and Hybrid, which performs much better on
the new settings. Moreover, the efficiency of index building and
maintenance would not be affected too much with the two
approaches.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Search process;
H.3.4 [Systems and Software]: Performance evaluation
(efficiency and effectiveness)
General Terms
Algorithms, Performance, Experimentation
Keywords
Top-k, Dynamic index pruning, Term proximity, Document
structure, Hybrid index structure
1. INTRODUCTION
Major commercial web search engines [12, 17, 27] have grown to
index billions of pages. In spite of such large amount of data, end
users expect search results being retrieved quickly and with high
accuracy. In web search, as top few results instead of all relevant
documents are often required, it is possible to speed up the
searching process by skipping the relevance computation of some
documents. Various dynamic index pruning mechanisms [1, 2, 15]
have been proposed for this purpose.
The following kind of ranking functions are assumed in most
papers studying efficient top-k computation,
(1.1)
where F(D,Q) represents the overall relevance score of document
D to query Q, G(D) is the (query-independent) static rank (e.g.
PageRank [4]) of the document, T(D,Q) is the overall term score
of D to Q, and T(D,t) is the term-score of D and query term t,
computed via a term-weighting function (e.g. BM25 [22]). And
, and are parameters satisfying and
.
The above formula is monotonic with respect to document static
rank G(D) and term score T(D,t). This property is utilized by
existing approaches to estimate the score upper bound of unseen
documents, and therefore to skip the score computation of some
lower-score documents (refer to section 2.3).
In addition to those factors in Formula 1.1, the search quality of
modern search engines depends heavily on some other evidence:
document structure, anchor-text, and term proximity. Document
structure means that a web page often comprises multiple fields
(title, URL, body text, etc). An appropriate usage of document
field structure can improve search results effectively, as we know
terms appearing in special fields like title, URL are generally
more important. Anchor text is a piece of clickable text that links
to a target web page. As anchor text aggregates the (relatively
objective) opinion of potentially a large number of other pages, it
actually acts as a special important field of a web page. Finally,
term proximity demonstrates how close query terms appear in a
document. Intuitively, one document with high term proximity
values (i.e. query terms are near in the document) should be more
relevant to the query than another document in which query terms
are far away from one another, if other factors are the same for the
two documents. Take query “knowledge management” as an
example. It is clear that quite a lot of pages on the web containing
both “knowledge” and “management” are actually irrelevant to
knowledge management. While if term “knowledge” appears in a
document followed by term “management”, the document will be
much probably related to knowledge management. Term
proximity is more important in web search than in traditional
information retrieval systems, due to the fact that there are usually
millions of relevant results to a query.
Due to the importance of term proximity and document structure
in modern commercial search engines, effective top-k
computation approaches should be studied by considering these
factors. Unfortunately, although quite a few indexing pruning
strategies have been proposed, few (if have) of them consider
term proximity in the ranking functions. This motivates us to
study the effectiveness of existing approaches when ranking
functions are fortified with document structure and term proximity
* This work was done when the author was an intern at Microsoft Research Asia
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
CIKM’07, November 6-8, 2007, Lisboa, Portugal.
Copyright 2007 ACM 978-1-59593-803-9/07/0011...$5.00.
771
Page 2
factors, and to propose new index structures and pruning
strategies.
With term proximity and document structure information being
considered, ranking functions would become more complex (see
subsection 3.1). More importantly, term proximity makes a
ranking function NOT monotonic (w.r.t. query term scores) any
more. Since term proximity is determined by the relationship
between all query terms rather than one single query term, a
document with low term score values may have a high term
proximity score (and therefore a high overall relevance score), and
vice versa. This adds additional challenges to the efficient top-k
computation problem.
We argue that, partially due to the nonmonotonicity of the new
ranking functions, most existing top-k strategies become quite
inefficient with term proximity information being included. We
then propose and study two index structures and their
corresponding pruning strategies, which efficiently address the
top-k problem by utilizing document structure information. And
the efficiency of index building and maintenance would not be
affected too much with the two approaches.
We make the following contributions in this paper,
1). To the best of our knowledge, it is the first work which
studies and compares the performance of various top-k
computation approaches when ranking functions are extended
to support document structure and term proximity.
2). We come out with two top-k computation approaches
which performs much better than existing ones on the new
settings. The relative superiority between the two approaches
is also explored (in subsection 4.2.4).
The rest of this paper is organized as follows. Section 2 introduces
the preliminary knowledge and existing pruning techniques. Then
a study of pruning with document structure and term proximity
support will be addressed in Section 3. In Section 4 we will show
the experiment results of various pruning approaches and settings.
Finally Section 5 will be conclusions and future work.
2. BACKGROUND AND EXISTING INDEX
PRUNING TECHNIQUES
In this section, we first give some background concepts related to
top-k computation. Then some existing top-k approaches are
studied and categorized.
2.1 Background
Indexing and ranking are two key components of a web search
engine. To support efficient retrieval, a web collection needs to be
offline indexed. Given a query, one ranking function is adopted to
compute a relevance score for each document based on the
information stored in inverted index. The documents are then
sorted by their scores and the top k documents with the highest
scores are returned to end users. Although there are often millions
of documents which are somewhat related to the query, end users
only care about top k (e.g. k=10, 20, 50, 100) results.
One primary way of indexing large scale web documents is
inverted index, which organize document collection information
into many inverted lists, corresponding to different terms. For a
term t, its inverted list includes the information (DocIds, occurring
positions, etc) of all documents containing the term. Zobel and
Moffat [28] give a comprehensive introduction to inverted index
for text search engines.
Various dynamic index pruning techniques have been proposed
for efficient top-k computation. They aim to correctly identify the
top k results without completely scanning inverted lists and/or
computing the relevance scores of all documents.
2.2 Related Efforts
In database community, R. Fagin et al have a series of
comprehensive work [9, 10, 11] on efficient top-k computation.
As Fagin [11] states, if multiple inverted lists are sorted by
attributes value and the combination function is monotonic, then
an efficient top-k algorithm exists.
In information retrieval area, the work of index pruning could go
back to 1980s. Some earlier works are described in [5, 6, 13, 18,
20, 26]. Their main idea is to sort the entries by their contribution
to the score of the document and put the important entries in the
front of inverted index for early termination. Much work [1, 14]
has attempted to optimize the index structure in various ways.
Persin et al [20] propose to partition the inverted list into several
parts. When PageRank shows its power in web search, pruning
techniques considering PageRank are studied in Long [15].
Static index pruning [7, 8, 19] aims at reducing index size by
keeping only relatively important information of the inverted
index. Differently, we focus on the dynamic index pruning
problem which skips the computation of some document scores at
query execution time.
There has also some work discussing efficient top-k computation
in XML retrieval [16] where documents are structured. However,
we are unaware of any existing work discussing index pruning
with term proximity support.
Approximate and probabilistic [24] top-k computation are also
important to web search. By relaxing result quality requirements,
query processing efficiency has more space to improve. We focus
on exact top-k processing and would like to leave approximate
and probabilistic query processing as future work.
Seg1
Segm
Single
Segment
B
y
P
R
o
r
I
m
p
a
c
t
B
y
P
R
B
y
P
R
Single
Segment
B
y
D
o
c
I
D
High-impact
Segment
low-impact
Segment
(a) (b) (c)
Figure 1. Existing index structures: (a). All documents in an
inverted list are sorted by DocId; (b). Documents are ordered
by static rank or one kind of impact score; (c). Each inverted
list is divided into some segments by impact value, with
documents in each segment sorted by static rank.
2.3 Existing Index Structures and Pruning
Strategies for Efficient Top-K Computation
The organization of inverted index is the key to index pruning.
Efficient top-k strategies are commonly based on specially
designed index structures. Here we briefly summarize the index
structures utilized in previous work and their corresponding
pruning strategies.
The documents in each inverted list are often naturally sorted by
document IDs which enables straightforward implementation of
772
strategies.
With term proximity and document structure information being
considered, ranking functions would become more complex (see
subsection 3.1). More importantly, term proximity makes a
ranking function NOT monotonic (w.r.t. query term scores) any
more. Since term proximity is determined by the relationship
between all query terms rather than one single query term, a
document with low term score values may have a high term
proximity score (and therefore a high overall relevance score), and
vice versa. This adds additional challenges to the efficient top-k
computation problem.
We argue that, partially due to the nonmonotonicity of the new
ranking functions, most existing top-k strategies become quite
inefficient with term proximity information being included. We
then propose and study two index structures and their
corresponding pruning strategies, which efficiently address the
top-k problem by utilizing document structure information. And
the efficiency of index building and maintenance would not be
affected too much with the two approaches.
We make the following contributions in this paper,
1). To the best of our knowledge, it is the first work which
studies and compares the performance of various top-k
computation approaches when ranking functions are extended
to support document structure and term proximity.
2). We come out with two top-k computation approaches
which performs much better than existing ones on the new
settings. The relative superiority between the two approaches
is also explored (in subsection 4.2.4).
The rest of this paper is organized as follows. Section 2 introduces
the preliminary knowledge and existing pruning techniques. Then
a study of pruning with document structure and term proximity
support will be addressed in Section 3. In Section 4 we will show
the experiment results of various pruning approaches and settings.
Finally Section 5 will be conclusions and future work.
2. BACKGROUND AND EXISTING INDEX
PRUNING TECHNIQUES
In this section, we first give some background concepts related to
top-k computation. Then some existing top-k approaches are
studied and categorized.
2.1 Background
Indexing and ranking are two key components of a web search
engine. To support efficient retrieval, a web collection needs to be
offline indexed. Given a query, one ranking function is adopted to
compute a relevance score for each document based on the
information stored in inverted index. The documents are then
sorted by their scores and the top k documents with the highest
scores are returned to end users. Although there are often millions
of documents which are somewhat related to the query, end users
only care about top k (e.g. k=10, 20, 50, 100) results.
One primary way of indexing large scale web documents is
inverted index, which organize document collection information
into many inverted lists, corresponding to different terms. For a
term t, its inverted list includes the information (DocIds, occurring
positions, etc) of all documents containing the term. Zobel and
Moffat [28] give a comprehensive introduction to inverted index
for text search engines.
Various dynamic index pruning techniques have been proposed
for efficient top-k computation. They aim to correctly identify the
top k results without completely scanning inverted lists and/or
computing the relevance scores of all documents.
2.2 Related Efforts
In database community, R. Fagin et al have a series of
comprehensive work [9, 10, 11] on efficient top-k computation.
As Fagin [11] states, if multiple inverted lists are sorted by
attributes value and the combination function is monotonic, then
an efficient top-k algorithm exists.
In information retrieval area, the work of index pruning could go
back to 1980s. Some earlier works are described in [5, 6, 13, 18,
20, 26]. Their main idea is to sort the entries by their contribution
to the score of the document and put the important entries in the
front of inverted index for early termination. Much work [1, 14]
has attempted to optimize the index structure in various ways.
Persin et al [20] propose to partition the inverted list into several
parts. When PageRank shows its power in web search, pruning
techniques considering PageRank are studied in Long [15].
Static index pruning [7, 8, 19] aims at reducing index size by
keeping only relatively important information of the inverted
index. Differently, we focus on the dynamic index pruning
problem which skips the computation of some document scores at
query execution time.
There has also some work discussing efficient top-k computation
in XML retrieval [16] where documents are structured. However,
we are unaware of any existing work discussing index pruning
with term proximity support.
Approximate and probabilistic [24] top-k computation are also
important to web search. By relaxing result quality requirements,
query processing efficiency has more space to improve. We focus
on exact top-k processing and would like to leave approximate
and probabilistic query processing as future work.
Seg1
Segm
Single
Segment
B
y
P
R
o
r
I
m
p
a
c
t
B
y
P
R
B
y
P
R
Single
Segment
B
y
D
o
c
I
D
High-impact
Segment
low-impact
Segment
(a) (b) (c)
Figure 1. Existing index structures: (a). All documents in an
inverted list are sorted by DocId; (b). Documents are ordered
by static rank or one kind of impact score; (c). Each inverted
list is divided into some segments by impact value, with
documents in each segment sorted by static rank.
2.3 Existing Index Structures and Pruning
Strategies for Efficient Top-K Computation
The organization of inverted index is the key to index pruning.
Efficient top-k strategies are commonly based on specially
designed index structures. Here we briefly summarize the index
structures utilized in previous work and their corresponding
pruning strategies.
The documents in each inverted list are often naturally sorted by
document IDs which enables straightforward implementation of
772
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime
Start using Mendeley in seconds!
Readership Statistics
7 Readers on Mendeley
by Discipline
by Academic Status
57% Ph.D. Student
29% Student (Master)
14% Researcher (at a non-Academic Institution)
by Country
57% China
29% United States
14% Japan


