Minimal Edit-Based Diffs for Large Trees

6Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Hierarchically structured data are commonly represented as trees and have given rise to popular data formats like XML or JSON. An interesting query computes the difference between two versions of a tree, expressed as the minimum set of node edits (deletion, insertion, label rename) that transform one tree into another, commonly known as the tree edit distance. Unfortunately, the fastest tree edit distance algorithms run in cubic time and quadratic space and are therefore not feasible for large inputs. In this paper, we leverage the fact that the difference between two versions of a tree is typically much smaller than the overall tree size. We propose a new tree edit distance algorithm that is linear in the tree size for similar trees. Our algorithm is based on the new concept of top node pairs and avoids redundant distance computations, the main issue with previous solutions for tree diffs. We empirically evaluate the runtime of our algorithm on large synthetic and real-world trees; our algorithm clearly outperforms the state of the art, often by orders of magnitude.

Cite

CITATION STYLE

APA

Pawlik, M., & Augsten, N. (2020). Minimal Edit-Based Diffs for Large Trees. In International Conference on Information and Knowledge Management, Proceedings (pp. 1225–1234). Association for Computing Machinery. https://doi.org/10.1145/3340531.3412026

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free