Sign up & Download
Sign in

Dynamic Query Scheduling in Data Integration Systems

by Françoise Fabret
ICDE (2000)

Abstract

Execution plans produced by traditional query optimizers for data integration queries may yield poor performance for several reasons. The cost estimates may be inaccurate, the memory available at run-time may be insufficient, or data delivery rate can be unpredictable. In this paper, we address the problem of unpredictable data arrival rate. We propose to dynamically schedule queries in order to deal with irregular data delivery rate and gracefully adapt to the available memory. Our approach performs careful step-by-step scheduling of several query fragments and processes these fragments based on data arrivals. We describe a performance evaluation that shows important performance gains in several configurations.

Cite this document (BETA)

Available from portal.acm.org
Page 1
hidden

Dynamic Query Scheduling in Data Integration Systems


Dynamic Query Scheduling in Data Integration Systems

Luc Bouganim*,** Françoise Fabret** C. Mohan**,*** Patrick Valduriez**


* PRiSM Laboratory
78035 Versailles
France
Luc.Bouganim@prism.uvsq.fr
** INRIA Rocquencourt
France
Francoise.Fabret@inria.fr
Patrick.Valduriez@inria.fr
***IBM Almaden Research
USA
mohan@almaden.ibm.com





Abstract

Execution plans produced by traditional query optimizers
for data integration queries may yield poor performance for
several reasons. The cost estimates may be inaccurate, the
memory available at run-time may be insufficient, or data
delivery rate can be unpredictable. In this paper, we address
the problem of unpredictable data arrival rate. We propose
to dynamically schedule queries in order to deal with
irregular data delivery rate and gracefully adapt to the
available memory. Our approach performs careful step-by-
step scheduling of several query fragments and processes
these fragments based on data arrivals. We describe a
performance evaluation that shows important performance
gains in several configurations.

1 Introduction
Research in data integration systems has popularized the
mediator/wrapper architecture whereby a mediator provides
a uniform interface to query heterogeneous data sources
while wrappers map the uniform interface into the data
source interfaces [13]. In this context, processing a query
consists in sending sub-queries to data source wrappers, and
then integrating the sub-query results at the mediator level to
produce the final response.
Classical query processing, based on the distinction
between compile-time and runtime, could be used here. The
query is optimized at compile time, thus resulting in a
complete query execution plan (QEP). At runtime, the query
engine executes the query, following strictly the decisions of
the query optimizer. This approach has proven to be
effective in centralized systems where the compiler can
make good decisions. However, the execution of an
integration query plan produced with this approach can
result in bad performance because the mediator has a poor
knowledge of the behavior of the remote sources.
First, the data arrival rate, at the mediator from a
particular source, is typically difficult to predict and control
because it depends on the complexity of the sub-query
assigned to that source, the load of the remote source and
the characteristics of the network. Delays in data delivery
may stall the query engine, leading to a dramatic increase in
response time.
Second, the characteristics of the sub-query results are
difficult to assess, due to the autonomous nature of the data
sources. The sizes of intermediate results used to estimate
the costs of the integration query execution plan are then
likely to be inaccurate. Poor performance is therefore the
likely result of a sub-optimal execution plan, and, of a bad
estimate of the memory needed to execute the plan. Indeed,
the amount of available memory at runtime for processing
the integration query may be much less than what was
assumed at compile time. Executing the query “as is” might
cause thrashing of the system because of paging [4,12].
All these problems have led database researchers and
implementers to resort to dynamic strategies to correct or
adapt the static query execution plan.
1.1. Unpredictable Data Arrival Rate
In this paper, we mainly address the performance
problems due to unpredictable data delivery rates in data
integration systems. Dynamic strategies are a logical
response to this problem as they try to adapt dynamically the
query execution plan to the execution context. This
adaptation can be done at three different levels:
• at the operator level, using relational operators that are
able to absorb delays in delivery. [8] has adapted the
double-pipelined hash join [16], originally designed for
parallel databases. However, such an approach is restricted
to hash-based queries (i.e., equi-joins).
• at the scheduling level, by modifying on the fly the
scheduling of the operators to avoid query engine stalling
[1,2,15].
• at the query execution plan level, by partially re-
optimizing the query plan in order to adapt to the data
arrival rate using estimates of such rates [1,15].
As in [8,15], we think that partial re-optimization and
dynamic scheduling are complementary and should be used
together to provide good performance. A general algorithm
for dynamic optimization can be sketched as follows:

Produce an initial QEP
Loop
Process the current QEP using dynamic scheduling strategies
If scheduling strategies fail or the plan appears to be sub-
optimal, apply dynamic re-optimization
End Loop
Page 2
hidden
Partial re-optimization of the query plan is difficult to
implement and tune [15]. Moreover, the possibility for re-
optimization decreases as query execution reaches
completion (because of the results previously computed).
[1,15] describe approaches which combine dynamic
scheduling with partial re-optimization. But many problems
remain at the scheduling level. Solving them may alleviate
the need for dynamic re-optimization.
In this paper, we describe a new dynamic scheduling
strategy to address the performance problems due to
unpredictable data delivery. But such a strategy may
increase memory consumption and thus interfere with
decisions regarding memory allocation. Thus, we must also
take into account the problem of memory limitation.
Our strategy differs significantly from previous ones. In
the following, we briefly compare it to the most related
approach which is query scrambling [1,2,15]. Further
discussion and comparison with other related work is
provided in [6].
1.2. Query Scrambling
The basic strategy of query scrambling is to dynamically
modify the query execution plan in reaction to unexpected
delays in data access. [2] defines three types of delays. (i)
Initial delay: when a delay occurs for the first tuple only; (ii)
Bursty arrival: when data arrives in bursts followed by long
periods of no arrival; and (iii) Slow delivery: when the
arrival rate is regular but slower than normal. In [1], bursty
arrival is considered while [1,15] focus on initial delays.
However, the authors have not provided any solution to the
problem of slow delivery.
To handle initial delays, the authors propose, in a first
phase, to reschedule the query plan. If the latter is not
sufficient, a second phase creates a new execution plan
using heuristics [2] or a query optimizer [15]. This second
phase is a run-time re-optimization, and, as mentioned
above, can be complementary to any dynamic scheduling
strategy. In what follows, we only consider the first phase.
The different scrambling techniques are all based on the
same concept: react to a timeout while waiting for remote
data to arrive. When this timeout occurs, a scrambling step
takes place: The operator currently in execution, say O1, is
suspended (as it has no input data), and a new operator, say
O2, is selected for execution. Depending on the strategy or
on configuration parameters, O1 resumes as soon as data
arrives, or O2 is executed until it ends or until a new time-
out occurs. In this last case, a new scrambling step is
triggered.
Scrambling techniques have two distinct problems. First,
a delaying data source may be detected too late to enable a
timely reaction. For instance, if all the execution occurs
normally, and a single problem arises with the last accessed
data source, scrambling will be ineffective since there is no
more work to scramble [1].
Second, scrambling may be difficult to configure as it can
incur important overheads (e.g., materialization, memory
consumption). The behavior of scrambling strongly depends
on configuration parameters. Consider for instance the time-
out value. If its value is too large, scrambling will never
occur. On the contrary, a value that is too small may trigger
too many scrambling steps while simply waiting for the
delayed data might have been more effective.
1.3. Proposed Approach
Scrambling techniques assume that the execution plan
will, a priori, execute without delays and then react when
such delays occur. Our approach takes the opposite
direction. We suppose that delays will happen during the
execution, and include this factor in the execution strategy.
To do that, we constantly monitor the arrival rate of each
participating data source and the available memory. We use
this knowledge to elaborate a scheduling plan which is
periodically revised when the delivery rates change
significantly. Thus, we interleave planning phases, where
we plan for the near future and execution phases where we
react instantaneously to delays. This constant re-evaluation
of the scheduling plan allows us to adjust to the data arrival
rate and memory consumption.
The planning phase selects and orders independent query
fragments which can be processed concurrently (i.e., which
fit together in the available memory). The selection and the
ordering are based on heuristics which use two metrics: the
critical degree which quantifies the critical path (in terms of
total retrieval time) of the query execution plan, and the
benefit materialization indicator which represents the
profitability of scheduling a query fragment.
During the execution phase, query fragments are
considered for execution depending on their order and on
data availability. A fragment with a certain priority is
considered for processing a batch of data only if none of the
higher priority fragments has any data to process (i.e., those
fragments are temporarily blocked because of data
unavailability). The query engine can switch from one
fragment to another with negligible overhead and without
negative consequences as the scheduling plan ensures that
these fragments can be processed concurrently. Thus, the
query engine is stalled only if there is no available data for
all the fragments that are scheduled concurrently.
Using this strategy, we can address both the delays in
data delivery and memory limitation problems. Moreover,
this solution applies to any of the three previously defined
types of delays problems. As our approach is independent of
any timeout mechanism, it is able to hide repetitive short
delays, which makes it particularly suited to slow delivery
cases, e.g., when the remote sites are overloaded.
The remainder of the paper is organized as follows.
Section 2 presents the context and the query execution
problems. Section 3 describes the architecture of our query
execution engine. Section 4 presents the metrics and the
general strategy used by the query scheduler to dynamically
optimize the query execution (more details are given in [6]).
Section 5 gives an experimental validation which shows the
behavior of our query engine and the performance gains we
can obtain. Finally, Section 6 concludes.

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

5 Readers on Mendeley
by Discipline
 
 
by Academic Status
 
20% Other Professional
 
20% Ph.D. Student
 
20% Researcher (at an Academic Institution)
by Country
 
40% United States
 
20% Bosnia and Herzegovina
 
20% France