Sign up & Download
Sign in

Instance-based schema matching for web databases by domain-specific query probing

by Jiying Wang, J R Wen, Fred Lochovsky, W Y Ma
Proceedings of the Thirtieth international conference on Very large data basesVolume 30 ()

Abstract

In a Web database that dynamically provides information in response to user queries, two distinct schemas, interface schema (the schema users can query) and result schema (the schema users can browse), are presented to users. Each partially reflects the actual schema of the Web database. Most previous work only studied the problem of schema matching across query interfaces of Web databases. In this paper, we propose a novel schema model that distinguishes the interface and the result schema of a Web database in a specific domain. In this model, we address two significant Web database schemamatching problems: intra-site and inter-site. The first problem is crucial in automatically extracting data from Web databases, while the second problem plays a significant role in meta-retrieving and integrating data from different Web databases. We also investigate a unified solution to the two problems based on query probing and instance-based schema matching techniques. Using the model, a cross validation technique is also proposed to improve the accuracy of the schema matching. Our experiments on real Web databases demonstrate that the two problems can be solved simultaneously with high precision and recall.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

Instance-based schema matching fo...

Instance-based Schema Matching for Web Databases by Domain-specific Query Probing Jiying Wang* Computer Science Department Hong Kong Univ. of Science and Technology Hong Kong cswangjy@cs.ust.hk Ji-Rong Wen Information Management & System Group Microsoft Research Asia Beijing, China jrwen@microsoft.com Fred Lochovsky Computer Science Department Hong Kong Univ. of Science and Technology Hong Kong fred@cs.ust.hk Wei-Ying Ma Information Management & System Group Microsoft Research Asia Beijing, China wyma@microsoft.com Abstract In a Web database that dynamically provides information in response to user queries, two distinct schemas, interface schema (the schema users can query) and result schema (the schema users can browse), are presented to users. Each partially reflects the actual schema of the Web database. Most previous work only studied the problem of schema matching across query interfaces of Web databases. In this paper, we propose a novel schema model that distinguishes the interface and the result schema of a Web database in a specific domain. In this model, we address two significant Web database schema- matching problems: intra-site and inter-site. The first problem is crucial in automatically extracting data from Web databases, while the second problem plays a significant role in meta-retrieving and integrating data from different Web databases. We also investigate a unified solution to the two problems based on query probing and instance-based schema matching techniques. Using the model, a cross validation technique is also proposed to improve the accuracy of the schema matching. Our experiments on real Web databases demonstrate that the two problems can be solved simultaneously with high precision and recall. 1. Introduction Besides web pages crawlable by specific URLs, the Web also contains a vast amount of non-crawlable content. This hidden part of the Web is comprised of a large number of online Web databases consisting of a searchable interface (usually an HTML form) and a backend database, which dynamically provides information in response to user queries [5] [13]. As compared to the static surface Web, the hidden Web contains a much larger amount of high-quality (often structured) information [8]. In the hidden Web, it is usually difficult or even impossible to directly obtain the schemas of the Web databases without cooperation from the web sites. Instead, the web sites present two other distinct schemas, interface and result schema, to users (Figure 1). The interface schema presents the query interface, which exposes attributes that can be queried in the Web database. The result schema presents the query results, which exposes attributes that are shown to users. The interface schema is useful for applications, such as mediators, that query multiple Web databases, since they need complete knowledge about the query interface of each database. The result schema is critical for applications, such as data extraction, which extract instances from the query results. In addition to the importance of the interface and result schemas themselves, attribute matching1 across different schemas is also important. First, matching between different interface and result schemas (i.e., inter-site schema matching) is critical for meta-searching and data- integration among related Web databases. Second, matching between the interface and result schema of a single Web database (i.e., intra-site schema matching) enables automatic data annotation and database content crawling. Therefore, in this paper we focus on automatically discovering both the interface and result * This work was carried out when the author was visiting at Microsoft Research Asia. 1 Attribute matching is the process of determining the semantic correspondences among the attributes of two schemas. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004 408
Page 2
hidden
schemas of Web databases and matching semantically- related attributes between them. Previous approaches [16], [17], [21] to Web database schema matching primarily focus on matching query interfaces (i.e., on inter-site interface schema matching). The basic idea is to identify attribute labels from the descriptive text surrounding interface elements and then find synonym relationships between the identified labels. The performance of these approaches may be affected when no attribute description can be identified or the identified description is not informative (e.g., ���Search��� in the homepage of Amazon.com). In contrast, in this paper we propose a novel instance-based schema matching approach, motivated by the necessity to identify the result schemas of Web databases that often lack available attribute names or labels and the goal of simultaneously solving inter-site and intra-site schema matching. Our approach is mainly based on three observations about Web databases. First, improper2 queries often cause search failure or no returned results. Second, the keywords of proper queries that return results very likely reappear in the returned results��� corresponding attributes. Third, there is an underlying global schema3 for related Web databases in the same domain (proposed and verified in [16]). Accordingly, we introduce a query probing technique that first exhaustively sends query keywords residing in a domain-specific global schema, whose semantics are known in advance, then analyzes the re- occurrences of submitted query keywords in the returned result data, and finally identifies the semantically corresponding attributes from both the interface and result schemas based on the previous analysis. Using a domain-specific global schema, we present a combined schema model that can describe five kinds of schema matching for Web databases in the same domain: global-interface, global-result, interface-result, interface- interface, and result-result. The model not only describes the matching relationships among different schemas of Web databases in a specific domain, but, more importantly, also provides a global view about how to reinforce the matching accuracy by conducting multiple kinds of schema matching simultaneously. Using the model, we also present a cross validation technique that improves the accuracy of the schema matching results. The main contributions of this paper are: ��� Introduction of a novel schema model of a single Web database that distinguishes what information users can query and what information users can browse. ��� Introduction of a generative view that includes five kinds of schema matching for related Web databases in a specific domain. 2 ���Proper��� means that the semantics of the query keywords match the semantics of the input element. 3 The global schema is a view capturing common attributes of data in the specific domain. ��� Introduction of an instance-based method based on domain-specific query probing, along with mutual information and vector similarity analysis, to automatically match various schemas of Web databases (intra-site and inter-site). ��� Benefiting from the above generative view, introduction of a cross validation technique based on an approximate solution of the graph partitioning problem to improve the accuracy of different kinds of schema matching. The rest of this paper is organized as follows. In section 2, we present our model with five schema matchings for Web databases. In section 3, the domain- specific query probing technique is introduced. We propose, in section 4, an instance-based schema matching approach with a cross validation technique, to solve both the intra-site and inter-site schema matching problems at the same time. Section 5 presents the experimental results of testing our approaches on real Web databases. Section 6 reviews existing work on the schema matching problem and how it correlates with our approach. Finally, we give our conclusions and future work in section 7. 2. Combined Schema Model A Web database is usually comprised of a query interface and a backend database. When a user query is submitted through the query interface, the site accesses its backend database for relevant data and returns the results to the user. Specifically, the query interface of the Web database usually contains multiple input elements, each of which may be associated with a schema attribute of the backend database. Data objects that the Web database returns to users are usually semi-structured, as their attribute values are encoded into HTML tags. Therefore, both the Web database interface and the returned results partially reflect the schema of the backend database, but in different ways. For instance, Figure 1 shows an example of an online bookstore 4 . The part labelled Data Attributes shows a possible schema of the backend database consisting of six 5 attributes {Title, Author, Publisher, ISBN, Format, Publication Date}. The part labelled Interface shows the query interface, which contains five input elements with surrounding text describing their semantics. When the keyword query ���Harry Potter��� is submitted through the ���Title��� element in the interface, a result page is returned by the web site containing its answer to the query (labelled Result in Figure 1 and containing three book instances with associated attribute values). From this example we can clearly see the difference between the attribute information contained in the query interface and that contained in the result pages. Although the site may provide an element in the interface for users to search on a particular data attribute (e.g., ���ISBN��� 4 http://www.mysimon.com/ 5 The exact number is not known. 409

Readership Statistics

21 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
57% Ph.D. Student
 
14% Student (Bachelor)
 
10% Student (Master)
by Country
 
19% China
 
14% Brazil
 
10% Germany

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in