Sign up & Download
Sign in

Efficiently ordering query plans for data integration

by A H Doan, A Hatevy
Data Engineering 2002 Proceedings 18th International Conference on (2002)

Cite this document (BETA)

Available from ieeexplore.ieee.org
Page 1
hidden

Efficiently ordering query plans for data integration

In Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), 2002, to appear
Efficiently Ordering Query Plans for Data Integration
AnHai Doan & Alon Halevy
Department of Computer Science and Engineering
University of Washington, Seattle, WA, 98195
fanhai,along@cs.washington.edu
Abstract
The goal of a data integration system is to provide a uni-
form interface to a multitude of data sources. Given a user
query formulated in this interface, the system translates it
into a set of query plans. Each plan is a query formulated
over the data sources, and specifies a way to access sources
and combine data to answer the user query.
In practice, when the number of sources is large, a data-
integration system must generate and execute many query
plans with significantly varying utilities. Hence, it is cru-
cial that the system finds the best plans efficiently and ex-
ecutes them first, to guarantee acceptable time to and the
quality of the first answers. We describe efficient solutions
to this problem. First, we formally define the problem of
ordering query plans. Second, we identify several interest-
ing structural properties of the problem and describe three
ordering algorithms that exploit these properties. Finally,
we describe experimental results that suggest guidance on
which algorithms perform best under which conditions.
1. Introduction
The goal of a data integration system is to provide a uni-
form interface to a multitude of data sources, thereby free-
ing the user from laborious manual interaction with the in-
dividual sources. The system provides this interface by al-
lowing users to pose queries through a mediated schema,
which is a virtual schema that captures the salient aspects
of the application domain.
A data integration system typically consists of three main
components: query reformulator, optimizer, and execution
engine. Given a user query formulated in the mediated
schema, the query reformulator translates it into a set of
query plans. Each plan is a query formulated over the data
sources, and specifies a way to access sources and combine
data to answer the user query.
Example 1.1 Consider a data integration system that an-
swers queries related to movies. Suppose the system can
access sources V
1
; V
2
; V
3
that contain tuples hactor;moviei
and sources V
4
; V
5
; V
6
that contain tuples hmovie; reviewi.
Given a query asking for reviews of movies starring Harri-
son Ford, the reformulator may generate nine query plans,
each accessing a source among V
1
V
3
to ask for the ti-
tles of movies starring Ford, then feeding these titles into a
source among V
4
V
6
to obtain the reviews.2
Each query plan is then given to the query optimizer,
which produces a physical query execution plan. A physical
plan specifies exactly how the query plan is to be evaluated,
including the order of the operations and the specific algo-
rithm used for every operation (e.g., algorithms for joins,
selections). Finally, the query execution plans are evaluated
by the query execution engine. It is important to note that
since sources may be incomplete, no single query plan is
guaranteed to produce all the answers. Hence, the answer
to a user query is the union of the output of all query plans.
Much research effort in data integration has concentrated
on reformulation and optimization issues. Several algo-
rithms to reformulate user queries have been proposed (e.g.,
[15, 5, 19]). Optimization is recognized as crucial to build-
ing practical data integration system, and hence has led to
many works at all three levels of query evaluation: reformu-
lation [4, 5, 7, 12], optimization [9, 23, 12], and execution
[20, 11, 2].
At the reformulation level, most optimization ap-
proaches have focused on minimizing the cost to obtain all
answers from the sources. In many data integration appli-
cations, however, the time to and the quality of the first an-
swers is the most important. This is because for applica-
tions with a large number of sources, typically the number
of query plans is very large and plan evaluation is costly, so
executing all query plans is expensive and often infeasible.
Furthermore, query plans tend to vary significantly in their
utility (e.g., coverage, execution time, monetary cost, etc.),
depending on which sources they access [13, 18].
Hence, an important optimization problem at the query-
reformulation level is to find query plans in decreasing or-
der of their utility, so that the data integration system can
focus on and execute the best plans first. Query execution
can then be aborted as soon as the user has found a satis-
factory answer, or when allotted resource limits have been
reached.
Example 1.2 Consider plan coverage, defined as the num-
ber of tuples returned by a plan that haven’t been returned
by any plan executed previously [6, 7, 12]. If sources have
equal access cost, then executing query plans in the de-
creasing order of their coverage returns as many answers
as possible as soon as possible. Consequently, it maximizes
the likelihood of obtaining a satisfactory answer early [6].
If sources have differing access cost, however, then prefer-

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

3 Readers on Mendeley
by Discipline
 
by Academic Status
 
67% Ph.D. Student
 
33% Assistant Professor
by Country
 
33% United Kingdom
 
33% Turkey
 
33% Spain