Minimal Edit-Based Diffs for Large Trees

Mateusz Pawlik; Nikolaus Augsten

Conference ProceedingsOPEN ACCESS

Minimal Edit-Based Diffs for Large Trees

International Conference on Information and Knowledge Management, Proceedings (2020) 1225-1234

DOI: 10.1145/3340531.3412026

6Citations

6Readers

Get full text

Abstract

Hierarchically structured data are commonly represented as trees and have given rise to popular data formats like XML or JSON. An interesting query computes the difference between two versions of a tree, expressed as the minimum set of node edits (deletion, insertion, label rename) that transform one tree into another, commonly known as the tree edit distance. Unfortunately, the fastest tree edit distance algorithms run in cubic time and quadratic space and are therefore not feasible for large inputs. In this paper, we leverage the fact that the difference between two versions of a tree is typically much smaller than the overall tree size. We propose a new tree edit distance algorithm that is linear in the tree size for similar trees. Our algorithm is based on the new concept of top node pairs and avoids redundant distance computations, the main issue with previous solutions for tree diffs. We empirically evaluate the runtime of our algorithm on large synthetic and real-world trees; our algorithm clearly outperforms the state of the art, often by orders of magnitude.

Author supplied keywords

Cite

CITATION STYLE

APA

Pawlik, M., & Augsten, N. (2020). Minimal Edit-Based Diffs for Large Trees. In International Conference on Information and Knowledge Management, Proceedings (pp. 1225–1234). Association for Computing Machinery. https://doi.org/10.1145/3340531.3412026

Minimal Edit-Based Diffs for Large Trees

Abstract

Author supplied keywords

Cite

Register to see more suggestions