Sign up & Download
Sign in

Web object retrieval

by Z Nie, Y Ma, S Shi, J R Wen, W Y Ma
Proceedings of the 16th international conference on World Wide Web WWW 07 (2007)

Abstract

The primary function of current Web search engines is essentially relevance ranking at the document level. However, myriad structured information about real-world objects is embedded in static Web pages and online Web databases. Document-level information retrieval can unfortunately lead to highly inaccurate relevance ranking in answering object-oriented queries. In this paper, we propose a paradigm shift to enable searching at the object level. In traditional information retrieval models, documents are taken as the retrieval units and the content of a document is considered reliable. However, this reliability assumption is no longer valid in the object retrieval context when multiple copies of information about the same object typically exist. These copies may be inconsistent because of diversity of Web site qualities and the limited performance of current information extraction techniques. If we simply combine the noisy and inaccurate attribute information extracted from different sources, we may not be able to achieve satisfactory retrieval performance. In this paper, we propose several language models for Web object retrieval, namely an unstructured object retrieval model, a structured object retrieval model, and a hybrid model with both structured and unstructured retrieval features. We test these models on a paper search engine and compare their performances. We conclude that the hybrid model is the superior by taking into account the extraction errors at varying levels.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

Web object retrieval

Web Object Retrieval

Zaiqing Nie, Yunxiao Ma, Shuming Shi, Ji-Rong Wen, Wei-Ying Ma
Microsoft Research Asia, Beijing, China
{znie, yunxiaom, shumings, jrwen, wyma}@microsoft.com

ABSTRACT
The primary function of current Web search engines is essentially
relevance ranking at the document level. However, myriad
structured information about real-world objects is embedded in
static Web pages and online Web databases. Document-level
information retrieval can unfortunately lead to highly inaccurate
relevance ranking in answering object-oriented queries. In this
paper, we propose a paradigm shift to enable searching at the
object level. In traditional information retrieval models,
documents are taken as the retrieval units and the content of a
document is considered reliable. However, this reliability
assumption is no longer valid in the object retrieval context when
multiple copies of information about the same object typically
exist. These copies may be inconsistent because of diversity of
Web site qualities and the limited performance of current
information extraction techniques. If we simply combine the noisy
and inaccurate attribute information extracted from different
sources, we may not be able to achieve satisfactory retrieval
performance. In this paper, we propose several language models
for Web object retrieval, namely an unstructured object retrieval
model, a structured object retrieval model, and a hybrid model
with both structured and unstructured retrieval features. We test
these models on a paper search engine and compare their
performances. We conclude that the hybrid model is the superior
by taking into account the extraction errors at varying levels.
Categories and Subject Descriptors
H.3.3 [Information Systems]: Information Search and Retrieval –
Retrieval Models
General Terms
Algorithms, Experimentation
Keywords
Web Objects, Information Retrieval, Language Model,
Information Extraction
1. INTRODUCTION
The primary function of current Web search engines is essentially
relevance ranking at the document level, a paradigm in
information retrieval for more than 25 years [1]. However, there
are various kinds of objects embedded in static Web pages or Web
databases. Typical objects are people, products, papers,
organizations, etc. We can imagine that if these objects can be
extracted and integrated from the Web, powerful object-level
search engines can be built to meet users' information needs more
precisely, especially for some specific domains [26]. For example,
in our Windows Live Product Search project
(http://products.live.com), we automatically extract a large set of
product objects from Web data sources [38], when users search for
a specific product, one can acquire a list of relevant product
objects with clear information such as name, image, price, and
features. We have been developing another object-level vertical
search system call Libra Academic Search (http://libra.msra.cn) to
help researchers and students locate information for scientific
papers, authors, conferences, and journals. With the concept of
Web objects, the search results of Libra could be a list of papers
with explicit title, author, and conference proceedings. Such
results are obviously more appealing than a list of URLS, which
costs user’s significant efforts to decipher for needed information.
We believe object-level Web search is particularly necessary in
building vertical Web search engines such as product search,
people search, scientific Web search, job search, community
search, and so on. Such a perspective has led to significant
research community interest, while related technologies such as
data record extraction [21][32][22], attribute value extraction[37],
and object identification on the Web [31] have been developed in
recent years. These techniques have made it possible for us to
extract and integrate all related Web information about the same
object together as an information unit. We call these Web
information units Web objects. Currently, little work has been
done in retrieving and ranking relevant Web objects to answer
user queries.
In this paper, we focus on exploring suitable models for retrieving
Web objects. There are two direct categories of candidate models
for object retrieval. The first is comprised of the traditional
document retrieval models, in which all contents in an object are
merged and treated as a text document. The other is made up of
structured document retrieval models, where an object can be
viewed as a structured document and the object attributes as
different document representations, with relevance calculated by
combining scores of different representations. We argue that
simply applying both of these two categories of models on Web
object retrieval does not achieve satisfactory ranking results. In
traditional IR models, documents are taken as the retrieval units
and the content of documents are considered reliable. However,
the reliability assumption is no longer valid in the object retrieval
context. There are several possible routes to introduce errors in
object contents during the process of object extraction:
• Source-level error: Since the quality of Web sources can
vary significantly, some information about an object in some
sources may be simply wrong.
• Record-level error: Due to the huge number of Web sources,
automatic approaches are commonly used to locate and
extract the data records from Web pages or Web databases
[22]. It is inevitable that the record extraction (i.e. detection)
process will introduce additional errors. The extracted
records may miss some key information or include some
irrelevant information, or both.

Copyright is held by the International World Wide Web Conference
Committee (IW3C2). Distribution of these papers is limited to classroom
use, and personal use by others.
WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.
ACM 978-1-59593-654-7/07/0005.

Page 2
hidden
• Attribute-level error: Even if the Web source is reliable and
the object contents are correctly detected, the description of
an object (i.e. object element labeling) may be still wrong
because of incorrect attribute value extraction. For example,
it is very common to label a product name by brand, or vice
versa. In Citeseer, we also usually find that author names are
concatenated to paper titles, or some author names are
missing.
Although [38] proposed a model which combined the record and
attribute extraction processes, it may also bring both record and
attribute level error which are similar to other technique. In this
paper, we focus on this unreliability problem in Web object
retrieval. Our basic ideas are based on two principles. First, as
described above, errors can be introduced in both the record level
and attribute level. Moreover, as errors will be propagated along
the extraction process, the accuracy of attribute extraction is surely
lower than that of record extraction. However, separating record
contents into multiple attributes will bring more information than
just treating all contents in a record as a unit. Therefore, it is
desirable to combine both record-level representation and
attribute-level representation. We hope, by combing
representations of multiple levels, our method is insensitive to
extraction accuracy. Second, multiple copies of information about
the same object usually exist. These copies may be inconsistent
because of diverse Web site qualities and the limited performance
of current information extraction techniques. If we simply
combine the noisy and inaccurate object information extracted
from different sources, we will not be able to achieve satisfactory
ranking results. Therefore, we need to distinguish the quality of
the records and attributes from different sources and trust data of
high reliability more and data of low reliability less. We hope that
even when data from some sites have low reliability, we can still
get good retrieval performance if some copies of the objects have
higher reliability. In other words, our method should also take
advantage of multiple copies of one object to achieve stable
performance despite varying qualities of the copies.
Based on the above arguments, our goal is to design retrieval
models insensitive to data errors and that can achieve stable
performance for data with varying extraction accuracies.
Specifically, we propose several language models for Web object
retrieval, namely an unstructured object retrieval model, a
structured object retrieval model, and a hybrid model with both
structured and unstructured retrieval features. We test these
models on a paper search engine and compare their performance.
We conclude that the best model is the one combining both object-
level and attribute-level evidence and taking into account of the
errors at different levels.
The rest of the paper is organized as follows. First, we define the
Web object information retrieval problem. In Section 3, we
introduce the models for Web object retrieval. In Section 4, we use
a scientific Web search engine further motivate the need for
object-level Web search and its advantages and challenges over
existing search engines. After that, we report our experimental
results in Section 5. Finally, we discuss related work in Section 6.
Section 7 states our conclusions.
2. BACKGROUND AND PROBLEM
DEFINITION
In this section, we first introduce the concept of Web objects and
object extraction. We then define the Web object retrieval problem.
2.1 Web Objects and Object Extraction
We define the concept of Web Objects as the principle data units
about which Web information is to be collected, indexed, and
ranked. Web objects are usually recognizable concepts, such as
authors, papers, conferences, or journals that have relevance to the
application domain. A Web object is generally represented by a set
of attributes },...,,{ 21 maaaA = . The attribute set for a specific
object type is predefined based on the requirements in the domain.
If we start to think of a user information need or a topic to search
on the Web as a form of Web Object, the search engine will need
to address at least the following technical issues in order to
provide intelligent search results to the user:
• Object-level Information Extraction – A Web object is
constructed by collecting related data records extracted from
multiple Web sources. The sources for holding object
information could be HTML pages, documents put on the
Web (e.g. PDF, PS, Word, and other formats.), and deep
contents hidden in Web databases. Figure 1 illustrates six
data records embedded in a Web page and six attributes from
a records. There is already extensive research to explore
algorithms for extraction of objects from Web sources (more
discussion about the diversity of sources is to come.)
• Object Identification and Integration – Each extracted
instance of a Web object needs to be mapped to a real world
object and stored into the Web data warehouse. To do so, we
need techniques to integrate information about the same
object and disambiguate different objects.
• Web object retrieval – After information extraction and
integration, we should provide retrieval mechanism to satisfy
users’ information needs. Basically, the retrieval should be
conducted at the object level, which means that the extracted
objects should be indexed and ranked against user queries.



Figure 1. Six Data Records in a Web Page and Six Attributes
from a Record

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

35 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
37% Ph.D. Student
 
23% Student (Master)
 
11% Researcher (at an Academic Institution)
by Country
 
20% China
 
14% Germany
 
9% United States