Sign up & Download
Sign in

Comparing hierarchical data in external memory

by Sudarshan S Chawathe
VLDB99 Proceedings of 25th International Conference on Very Large Data Bases September 710 1999 Edinburgh Scotland UK (1999)

Abstract

We present an external-memory algorithm for computing a minimum-cost edit script between two rooted, ordered, labeled trees. The I/O, RAM, and CPU costs of our algorithm are, respectively, 4mn+7m+5n, 6S, and O(MN+(M+N)S1.5), where M and N are the input tree sizes, S is the block size, m=M/S, and n=N/S. This algorithm can make effective use of surplus RAM capacity to quadratically reduce I/O cost. We extend to trees the commonly used mapping from sequence comparison problems to shortest-path problems in edit graphs.

Cite this document (BETA)

Available from citeseerx.ist.psu.edu
Page 1
hidden

Comparing hierarchical data in external memory

Comparing Hierarchical Data in External Memory
Sudarshan S. Chawathe
Department of Computer Science
University of Maryland
College Park, MD 20904
chaw@cs.umd.edu
Abstract
We present an external-memory algorithm for
computing a minimum-cost edit script between
two rooted, ordered, labeled trees. The I/O, RAM,
and CPU costs of our algorithm are, respectively,
4mn+7m+5n, 6S, andO(MN+(M+N )S1:5),
where M and N are the input tree sizes, S is the
block size, m = M=S, and n = N=S. This al-
gorithm can make effective use of surplus RAM
capacity to quadratically reduce I/O cost. We ex-
tend to trees the commonly used mapping from
sequence comparison problems to shortest-path
problems in edit graphs.
1 Introduction
We study the problem of comparing snapshots of data to
detect similarities and differences between them. Such
differencing of data has applications in version con-
trol, incremental view maintenance, data warehousing,
standing queries (subscriptions), and change management
[Tic85, LGM96, CAW98]. The RCS version control sys-
tem [Tic85] uses the diff program [MM85] to compute and
store only the differences between the new and old versions
of data that is checked in. As another version control ap-
plication, consider the process used to merge two divergent
versions of a program or document (e.g., the ediff/emerge
function in Emacs). The first step consists of comparing
the files containing the two versions to determine where
and how they differ. These differences are then presented
using a graphical interface that allows a user to determine
which variant to keep in the merged file.
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct commercial
advantage, the VLDB copyright notice and the title of the publication and
its date appear, and notice is given that copying is by permission of the
Very Large Data Base Endowment. To copy otherwise, or to republish,
requires a fee and/or special permission from the Endowment.
Proceedings of the 25th VLDB Conference,
Edinburgh, Scotland, 1999.
Differencing algorithms also play a key role in change
management systems such as C3 [CAW98]. Since many
databases, especially those on the Web, do not offer change
notification facilities, changes must be detected by com-
paring old and new results of a query. Once changes have
been detected in this manner, C3 uses them to implement
standing queries based on the current state as well as the
history of the databases being monitored.
We can also use differencing algorithms to reduce the
amount of data transmitted over a network in mirroring
applications. Popular Web and FTP servers often have
dozens of mirror sites around the world. Changes made to
the master server need to be propagated to the mirror sites.
Ideally, the persons or programs making changes would
keep a record of exactly what data was updated. However,
in practice, due to the autonomous and loosely organized
nature of such sites, there is no reliable record of changes.
Further, even if such a record is available, it may be based
on a version that is different from the version currently at
a certain mirror site. Due to such difficulties, efficient mir-
roring requires differencing algorithms that compute and
propagate only the difference between the version at the
master server and that at a mirror site. Similar ideas en-
able differencing algorithms to improve efficiency in a data
warehousing environment [LGM96].
Differencing algorithms are also used to find, mark-up,
and browse changes between two or more versions of a
document [CRGMW96, CGM97]. Suppose we receive an
updated version of an online manual. Again, in the ideal
case the new version would highlight the way it differs from
the old one. However, for reasons similar to those stated
above, in practice we often need to detect the differences
ourselves by comparing the two versions. For example,
[CAW98] describes experiences in detecting and brows-
ing differences between different versions of a restaurant
review database on the Web, while [Yan91] describes the
implementation of an application that highlightsdifferences
between program versions.
There is a substantial body of prior work on differencing
algorithms. The main distinguishing features of the work
in this paper are the following. (See Section 6 for a detailed
discussion.)
Page 2
hidden
 We study algorithms for computing differences be-
tween snapshots of hierarchically structured data,
modeled using rooted, ordered, labeled trees. Our
model allows us to accurately capture the hierarchical
structure inherent in data such as source code, ob-
ject class hierarchies, structured documents, HTML,
XML, and SGML. For example, an online manual
typically has a well-defined hierarchical structure con-
sisting of chapters, sections, subsections, paragraphs,
and sentences. Algorithms that take this structure into
account produce results that are more meaningful than
those that treat their inputs as flat strings.
While the problem of differencing strings and se-
quences has been thoroughly studied and admits sev-
eral efficient solutions, the problem of differencing
trees remains challenging. Several formulations of
this problem are NP-hard [ZWS95]. In this paper, we
study a simple variation that admits efficient solutions.
 We do not assume that the snapshots being differ-
enced are small enough to fit entirely in main mem-
ory (RAM); instead, they reside in external mem-
ory (disk). For example, online manuals for complex
machinery, aircrafts, and submarines are tens or hun-
dreds of gigabytes in size, making it impracticable to
use main-memory differencing algorithms to compare
their versions.
When data resides in external memory, the number
of input-output operations (I/Os), and not the number
of CPU cycles, is the primary determinant of running
time. Therefore, external-memory algorithms use
techniques that try to minimize the number of I/Os. A
secondary but important consideration is the amount
of buffer space required in RAM. See [Vit98] for an
overview of external memory algorithms. In this pa-
per, we analyze algorithms based on their I/O, RAM,
and CPU costs.
As an illustrationof the importance of using an algorithm
that is cognizant of the hierarchical structure of data, con-
sider the following example from [Yan91]. Figure 1 depicts
depicts fragments of two program versions that are being
compared. A sequence comparison program such as the one
in[MM85] compares the inputs line-by-line and may result
in matching program text as suggested by the solid lines
in the figure. Given the nested structure of the program
fragment, it is clearly more meaningful to match the inputs
as suggested by the dashed lines in the figure. However,
the definition of optimality used by most sequence compar-
ison algorithms (based on a longest common subsequence)
considers the solution depicted using solid lines more de-
sirable [Mye86]. By modeling the hierarchical structure of
programs, tree differencing algorithms are able to produce
more meaningful results.
We now present a brief, informal definition of the dif-
ferencing problem we study in this paper. (See Section 2
for details.) A rooted, ordered, labeled tree is a tree in
which each node has a label and in which the order amongst
while(p) {
x = y + z;
}
a = b + c;
}
while(p) {
x = y + z;
}
while(p) {
a = b + c;
Figure 1: Importance of hierarchical structure
siblings is significant. (The label of a node intuitively rep-
resents the data content at that node; it is not a unique key
or object identifier.) Trees can be transformed using three
edit operations: (1) We can insert a new leaf node at a
specified location in the tree. (2) We can delete an existing
leaf node. (3) We can update the label of a node. Note that
the restriction that (1) and (2) operate only on leaf nodes
means that to delete an interior node, we must first delete
all its descendants; similarly, we must insert a node before
inserting any of its descendants.
An edit script is a sequence of edit operations that are
applied in the order listed. We associate a cost with each
edit operation and define the cost of an edit script to be the
sum of the costs of its component operations.
Problem Statement (informal): Given two rooted, or-
dered, labeled trees A and B, find a minimum-cost edit
script that transforms A to B.
Given trees A and B, we can transform one to the other
using any of an infinite number of edit scripts. (For exam-
ple, given any edit script that transforms A to B, we can
append to it operations that insert and immediately delete
a node, thus generating an infinite number of edit scripts.)
This fact motivates the minimum-cost requirement in the
problem definition. Another motivation for the minimum-
cost requirement is the following: Given two trees that dif-
fer only in one node label, the intuitively desirable edit script
is one that contains a single update operation. We need to
weed out edit scripts that unnecessarily insert, delete, and
update nodes. An edit script of minimum cost cannot con-
tain such redundant or wasteful edit operations (since we
can obtain another edit script with lower cost by getting rid
of the redundancies and inefficiencies).
The two main contributions of this paper may be sum-
marized as follows:
 We present an efficient external-memory algorithm for
computing the difference (minimum-cost edit script)
between two snapshots of hierarchical data (trees).
The I/O, RAM, and CPU costs of our algorithm are,
respectively, 4mn + 7m + 5n, 6S, and O(MNpS),
where M and N are the input tree sizes, S is the block
size, m = M=S, and n = N=S. To our knowledge,
this algorithm is the first external-memory differenc-
ing algorithm (for sequences or trees). The O(mn)
I/O complexity of our algorithm is optimal over a wide
class of computation models due to the O(mn) lower
bound for the sequence comparison problem (which

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in

Readership Statistics

9 Readers on Mendeley
by Discipline
 
by Academic Status
 
44% Ph.D. Student
 
22% Other Professional
 
11% Lecturer
by Country
 
22% Germany
 
11% Japan
 
11% Italy