Sign up & Download
Sign in

Optimizing Selections over Datacubes

by K A Ross, K A Zaman
Young (2000)

Cite this document (BETA)

Available from www.cs.columbia.edu
Page 1
hidden

Optimizing Selections over Datacubes

Optimizing Selections over Datacubes
Kenneth A. Ross∗.
Columbia University
kar@cs.columbia.edu
Kazi A. Zaman
Columbia University
zkazi@cs.columbia.edu
Abstract
Datacube queries compute aggregates over database re-
lations at a variety of granularities. Often one wants only
datacube output tuples whose aggregate value satisfies a
certain condition, such as exceeding a given threshold. We
develop algorithms for processing a datacube query using
the selection condition internally during the computation.
Thus, we can safely prune parts of the computation and end
up with a more efficient computation of the answer. Our
first technique, called “specialization”, uses the fact that
a tuple in the datacube does not meet the given threshold
to infer that all finer level aggregates cannot meet the
threshold. Our second technique is called “generaliza-
tion”, and applies in the case where the actual value of
the aggregate is not needed in the output, but used just to
compare with the threshold. We demonstrate the efficiency
of these techniques by implementing them within the sparse
datacube algorithm of Ross and Srivastava. We present a
performance study using synthetic and real-world data sets.
Our results indicate substantial performance improvements
for queries with selective conditions.
1 Introduction
Datacube queries compute aggregates over database re-
lations at a variety of granularities, and they constitute an
important class of decision support queries. The databases
may represent business data (such as sales data), medical
data (such as patient treatments) or scientific data (such as
large sets of experimental measurements).
The computation of a data cube query with k CUBE BY
attributes (B1, B2 . . . Bk) involves computing the aggre-
gates over a relation at 2k granularities where each granular-
ity is one of the the possible 2k subsets of our k CUBE BY
attributes. Attributes that are not present in such a subset
∗This research was supported by a David and Lucile Packard Founda-
tion Fellowship in Science and Engineering, by an NSF Young Investigator
Award, by NSF grant number IIS-98-12014, and by NSF CISE award
CDA-9625374.
SELECT a,b,c,d
FROM relation
CUBE BY a,b,c,d
HAVING aggregate(G) relop threshold
Figure 1. Selection over a projected data cube
SELECT a,b,c,d,aggregate(G)
FROM relation
CUBE BY a,b,c,d
HAVING aggregate(G) relop threshold
Figure 2. Selection over a data cube
are replaced by a special value “ALL” in the datacube result.
We refer to each of these granularities as a cuboid and we
use the notation Q( Bi) to denote the cuboid at granularity
Bi.
Often one wants only datacube output tuples whose ag-
gregate value satisfies a certain condition, such as exceeding
a given threshold. For example, one might ask for all
combinations of model, color, and year of cars (including
the special value “ALL” for each of the dimensions) for
which the total sales exceeded a given amount of money.
This query takes the form of Figure 1 which we call a
“projected datacube” because the aggregate is projected out
of the result. In some cases we may need to know the exact
value of sales too (Figure 2).
We can naively execute these queries as follows. Com-
pute the datacube using any of the existing datacube algo-
rithms [5, 1, 14] and check if the predicate in the HAVING
clause holds for each tuple in the datacube. This strategy is
reasonable if a large proportion of the datacube result tuples
satisfy the condition. However, if only a small fraction
satisfy the condition (i.e, the query is an example of an
“iceberg query” [4]) then it seems that we may be wasting
a lot of time computing aggregates that do not qualify.
Depending upon the aggregate function and the rela-
tional operator in the predicate, there are certain optimiza-
0-7695-0686-0/00 $10.00  2000 IEEE
Page 2
hidden
tions we can make use of. In this paper we propose two
kinds of optimization that we call generalization and spe-
cialization. We defer the formal details until Section 1.1.
For now, we motivate these techniques with examples.
Example 1.1: (Specialization) Suppose that we are com-
puting our example datacube query “Find all combinations
of model, color, and year of cars (including the special value
“ALL” for each of the dimensions) for which the total sales
exceeded $100,000. Output the total sales also.” Suppose
that during an intermediate step of the computation, we
determine that the total sales for all green cars is below
$100,000. Then we can immediately infer that datacube
output tuples grouped by (model,color), (year,color), and
(model,year,color) will never meet the threshold when the
color is green. If the computation corresponding to those
aggregates has not yet been performed, then perhaps we can
avoid that computation altogether. 2
Example 1.2: (Generalization) Suppose that we are com-
puting the projected datacube query “Find all combinations
of model, color, and year of cars (including the special
value “ALL” for each of the dimensions) for which the total
sales exceeded $100,000.” Note that in this case we do not
need the total sales in the output. Suppose that during an
intermediate step of the computation, we determine that the
total sales for white 1998 Taurus cars is above $100,000.
Then we can immediately infer that the class of white cars
satisfies the condition, the class of 1998 Taurus cars satisfies
the condition, etc. We can immediately output all of the
additional seven “generalizations” of (white,1998,Taurus).
If we have not yet performed the aggregation needed for
some of these additional tuples, then we can potentially
avoid such aggregation altogether. 2
In this paper we examine how we would make use of
generalization and specialization to carry out selections
over data cubes efficiently. We extend the Memory-Cube
and Partitioned-Cube algorithms which can deal ef-
ficiently with sparse data [14]. Memory-Cube computes
a set of paths which cover the search lattice and then
computes the cuboids on each path in turn. We exploit
specialization by altering the set of base tuples with which
each path is computed without affecting the correctness of
the result. Depending on the selectivity of our condition, we
can reduce the number of tuples which are processed in each
path leading to substantial improvements in performance.
Generalization is incorporated into the algorithms by in-
troducing marker tuples which allow us to skip computing
aggregates which are known to satisfy our selection crite-
ria. Specialization applies for both projected datacubes and
datacubes while generalization applies only for projected
datacubes. We demonstrate the efficiency of these modi-
fications by experiments carried out on synthetic and real-
world data sets for a variety of selection conditions. These
experiments support our overall conclusion that substantial
work can be saved.
We further demonstrate that projected datacubes are, in
general, easier to compute than datacubes that include the
aggregate result. In particular, it is possible to compute
projected datacubes with a HAVING clause on the median
in a distributive fashion. In contrast, there is no known
distributive algorithm for computing the median in the dat-
acube output. Hence users can get efficient answers to
queries such as “Find combinations of model, year and
color for which the median sale is greater than $10,000.”
1.1 Notation and Terminology
The computation of the various cuboids are not indepen-
dent of each other, but are closely related in that some of
them can be computed using others. The relationship be-
tween cuboids can be captured in terms of the search lattice
of the data cube [7]. Each granularity Bi ⊆ {B1, . . . , Bk}
is a node in the search lattice, and there is an edge from node
Bi to Bj if Bj is a subset of and has one fewer element than
Bi; Bi is said to be a parent of Bj in the search lattice. If
there is a path from Bi to Bj in the search lattice, Bi is
said to be of a finer granularity than Bj , and Bj is said
to be of a coarser granularity than Bi. Paths in the search
lattice precisely determine which of the cuboids can be
computed from which others. In particular, cuboid Q( Bj)
can be computed using Q( Bi) if and only if Bj is of coarser
granularity than Bi. A datacube tuple t1 is more general
than tuple t2 if it can be produced from t2 by replacing one
or more of t2’s non-ALL attributes with ALL values. We
can restate this by saying that t2 is more specialized than t1.
t1 and t2 come from cuboids at different levels of the search
lattice with a path from the cuboid of the more specialized
tuple (t2) to the cuboid of the more general (t1) one.
Definition 1.1: We define a cuboid consisting of tuples
with exactly n non-ALL dimension attribute values to be
a n-cuboid. 2
Using the categorizations of aggregate functions intro-
duced in [5], we focus on the case of distributive functions.
Definition 1.2: Let ∪ denote multiset union, and let S, S′
be multisets. An aggregate function g is distributive if there
exists a binary function h such that for every nonempty S
and S′, g(S ∪ S′) = h(g(S), g(S′)). 2
Examples of such aggregate functions include SUM,
COUNT, MIN and MAX. A property that is important when
considering potential optimizations of aggregate computa-
tion is the idempotent property.
Definition 1.3: Let S range over multisets over a domain
D, and let the aggregate function g be a mapping from
0-7695-0686-0/00 $10.00  2000 IEEE

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

2 Readers on Mendeley
by Discipline
 
by Academic Status
 
50% Senior Lecturer
 
50% Student (Postgraduate)
by Country
 
50% United Kingdom
 
50% Austria