Can phrase indexing help to process non-phrase queries?
- ISBN: 9781595939913
- DOI: 10.1145/1458082.1458174
Abstract
Modern web search engines, while indexing billions of web pages, are expected to process queries and return results in a very short time. Many approaches have been proposed for efficiently computing top-k query results, but most of them ignore one key factor in the ranking functions of commercial search engines - term-proximity, which is the metric of the distance between query terms in a document. When term-proximity is included in ranking functions, most of the existing top-k algorithms will become inefficient. To address this problem, in this paper we propose to build a compact phrase index to speed up the search process when incorporating the term-proximity factor. The compact phrase index can help more accurately estimate the score upper bounds of unknown documents. The size of the phrase index is controlled by including a small portion of phrases which are possibly helpful for improving search performance. Phrase index has been used to process phrase queries in existing work. It is, however, to the best of our knowledge, the first time that phrase index is used to improve the performance of generic queries. Experimental results show that, compared with the state-of-the-art top-k computation approaches, our approach can reduce average query processing time to 1/5 for typical setttings.
Author-supplied keywords
Can phrase indexing help to process non-phrase queries?
Queries?
Mingjie Zhu1¤ Shuming Shi2 Nenghai Yu1 Ji-Rong Wen2
University of Science and Technology of China1
Microsoft Research Asia2
mjzhu@ustc.edu1 ynh@ustc.edu.cn1 {shumings,jrwen}@microsoft.com2
ABSTRACT
Modern web search engines, while indexing billions of web
pages, are expected to process queries and return results in
a very short time. Many approaches have been proposed
for e±ciently computing top-k query results, but most of
them ignore one key factor in the ranking functions of com-
mercial search engines - term-proximity, which is the metric
of the distance between query terms in a document. When
term-proximity is included in ranking functions, most of the
existing top-k algorithms will become ine±cient. To address
this problem, in this paper we propose to build a compact
phrase index to speed up the search process when incorpo-
rating the term-proximity factor. The compact phrase index
can help more accurately estimate the score upper bounds
of unknown documents. The size of the phrase index is con-
trolled by including a small portion of phrases which are
possibly helpful for improving search performance. Phrase
index has been used to process phrase queries in existing
work. It is, however, to the best of our knowledge, the ¯rst
time that phrase index is used to improve the performance of
generic queries. Experimental results show that, compared
with the state-of-the-art top-k computation approaches, our
approach can reduce average query processing time to 1/5
for typical setttings.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Search Pro-
cess; H.3.4 [Systems and Software]: Performance evalua-
tion (e±ciency and e®ectiveness)
General Terms
Algorithms, Performance, Experimentation
¤This work was done when the author was an intern at Mi-
crosoft Research Asia.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM’08, October 26–30, 2008, Napa Valley, California, USA.
Copyright 2008 ACM 978-1-59593-991-3/08/10 ...$5.00.
Keywords
Top-k, Dynamic index pruning, Term proximity, Phrase in-
dex, Compact phrase indexing
1. INTRODUCTION
Modern commercial web search engines are expected to
process queries very e±ciently, typically thousands of queries
in one second. It is a challenging task considering that they
have grown to index billions of pages. To improve search
e±ciency, various e±cient query processing approaches [1,
3, 4, 8, 9, 10, 11, 13, 17] have been proposed and studied.
Given that only top few (e.g., top-10) results instead of all
relevant documents are often required in IR and web search,
these strategies speed up the searching process by carefully
reordering documents in the inverted index and skipping the
relevance computation of some documents on retrieval time.
Most of existing e±cient top-k approaches assume that the
following type of ranking functions is utilized in evaluating
documents,
F (D;Q) = ® ¢G(D) + ¯ ¢
X
t2Q
!t ¢ T (D; t) (1)
where F(D,Q) represents the overall relevance score of docu-
ment D to query Q, G(D) is the (query-independent) static
rank (e.g., PageRank [6]), and T(D,t) is the term-score of D
with respect to query term t, computed via a term-weighting
function (e.g., BM25 [16]). And ®, ¯, and !t are parameters
satisfying ®+ ¯ = 1 and Pt2Q !t = 1.
Ranking functions of the above format come from stan-
dard IR models and have been utilized in real information
retrieval problems. It is therefore meaningful for e±cient
top-k approaches to be proposed and studied based on them
in traditional IR. Large scale web search engines, however,
typically adopt much more complex ranking functions which
contain additional important factors/features than those in-
cluded in Formula 1. Among them, term-proximity is an
important factor which is critical for the search quality of
large scale web search engines.
Term-proximity demonstrates how close query terms ap-
pear in a document. Intuitively, one document in which
query terms are near in the document should be more rel-
evant to the query than another document in which query
terms are far away from one another, if other factors are
the same for the two documents. Take query fknowledge
managementg as an example. It is clear that quite a lot of
pages on the web containing both "knowledge" and "man-
agement" are actually irrelevant to knowledge management.
679
by term "management", the document will be very probably
relevant.
When the term-proximity factor is considered in ranking
functions, approaches optimized for Formula 1 may not be
e±cient anymore. Speci¯cally, approaches based on frequency-
ordering [1] or impact-ordering [4] become ine±cient, be-
cause high term scores do not mean high term-proximity
scores. According to [20], mainstream top-k strategies even
perform worse than the baseline (in which all documents
are evaluated and no top-k processing strategies is adopted)
when the weight of term proximity scores is large enough.
Including term-proximity in ranking functions does not
mean treating queries as phrase queries. A phrase query is
a multi-term query that only matches documents contain-
ing query terms as a phrase (i.e., query terms appear in
the document consecutively). End users typically input a
phrase query to web search engines by surrounding query
terms in double quotation marks or connecting them via
hyphens, e.g., f"knowledge management"g or fknowledge-
managementg. A document is not treated as being relevant
to phrase query f"knowledge management"g if it only con-
tains text snippet "the management of knowledge" but not
"knowledge management". However, the same document is
considered to be relevant to non-phrase query fknowledge
managementg.
Phrase queries are a special type of queries which are able
to be processed e±ciently via phrase indexing [5, 18]. In
phrase indexing, inverted lists are built for phrases rather
than single terms. The inverted list for a phrase includes
the information of all documents containing the phrase. It
is clear that, to process a phrase query, only the inverted
list for the phrase need to be scanned, which should be more
e±cient than scanning the inverted lists corresponding to all
the query terms contained in the query.
Given that 1) most existing top-k approaches fail to ef-
¯ciently answer generic queries with term-proximity sup-
port, and 2) phrase indexing is e±cient in processing phrase
queries, one natural question is: Can phrase indexing be uti-
lized for speeding up generic (non-phrase) queries? After all,
most query strings input by end users are not surrounded
by double quotation marks or connected by hyphens (thus
not strict phrase queries).
It is however not that easy for generic (non-phrase) queries
to be processed via phrase index. If only the phrase index
is scanned to answer a generic query, good documents may
be omitted, because a relevant document is not necessary to
contain the whole phrase for a generic query.
In this paper, we explore how to e±ciently process generic
(non-phrase) queries with the aid of phrase indexing. We
build a compact phrase index, in addition to the standard in-
verted index for single terms. Based on the compact phrase
index, we propose a retrieval strategy which, in processing
a query, scans both the phrase index and the inverted lists
for all its query terms. The strategy speeds up the search
process by making a more accurate estimation of the score
upper bounds of unknown documents. The size of the phrase
index is controlled by including a small portion of phrases
which are possibly helpful for improving search performance.
Experimental results show that, for typical settings, our ap-
proach reduces average query processing time to 1/5, com-
pared with existing mainstream e±cient top-k approaches
in the literature. Thus the answer to the above question is
D356, 0.99
knowledge
D27, 0.97
D5, 0.95
D413, 0.93
D356, 0.99
management
D9, 0.98
D27, 0.97
D984, 0.94
hits (or postings)
Figure 1: Inverted lists for terms "knowledge" and
"management" (documents are ordered by static
rank).
YES. To the best of our knowledge, it is the ¯rst time to
utilize phrase index to improve the performance of generic
queries.
The rest of this paper is organized as follows. Section 2
contains some background information, including the brief
introduction of inverted index, and existing e±cient query
processing approaches. Related work is discussed in Sec-
tion 3. Section 4 analyzes the di±culties of top-k processing
when term proximity is included in ranking functions. Our
approach is illustrated in detail in Section 5. Section 6 is
experimental setup and results. Finally Section 7 is conclu-
sions and future work.
2. BACKGROUND
2.1 Inverted Index
Indexing and ranking are two key components of a web
search engine. To support e±cient retrieval, a web collection
needs to be o²ine indexed. Given a query, one ranking
function is adopted to compute a relevance score for each
document based on the information stored in the index. The
documents are then sorted by their scores and the top k
documents with the highest scores are returned to end users.
One primary way of indexing large scale web documents
is the inverted index, which consists of many inverted lists,
corresponding to di®erent terms. For a term t, its inverted
list includes the information (doc-ids, occurring positions,
etc) of all documents containing the term. Zobel and Mo®at
[21] give a comprehensive introduction to inverted index for
text search engines. Figure 1 shows the inverted lists of
term "knowledge" and "management" respectively. In the
¯gure, "D356, 0.99" represents a document with doc-id 356
and static-rank value 0.99.
2.2 Efficient Top-K without Term Proximity
Various dynamic index pruning techniques have been pro-
posed for e±cient top-k computation. They aim to correctly
identify the top k results without completely scanning in-
verted lists and/or computing the relevance scores of all
documents. One common idea shared among most existing
e±cient query processing approaches is score upper-bound
estimation and early stopping. During processing a query,
the maximal possible score of all unseen (or un-evaluated)
documents is estimated. When the maximal possible score
is not greater than the score of the kth document in current
680
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime


