A scalable index for top-k subtree similarity queries

7Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

Abstract

Given a query tree Q, the top-k subtree similarity query retrieves the k subtrees in a large document tree T that are closest to Q in terms of tree edit distance. The classical solution scans the entire document, which is slow. The state-of-the-art approach precomputes an index to reduce the query time. However, the index is large (quadratic in the document size), building the index is expensive, updates are not supported, and data-specific tuning is required. We present a scalable solution for the top-k subtree similarity problem that does not assume specific data types, nor does it require any tuning. The key idea is to process promising subtrees first. A subtree is promising if it shares many labels with the query. We develop a new technique based on inverted lists that efficiently retrieves subtrees in the required order and supports incremental updates of the document. To achieve linear space, we avoid full list materialization but build relevant parts of a list on the fly. In an extensive empirical evaluation on synthetic and real-world data, our technique consistently outperforms the state-of-the-art index w.r.t. memory usage, indexing time, and the number of candidates that must be verified. In terms of query time, we clearly outperform the state of the art and achieve runtime improvements of up to four orders of magnitude.

Cite

CITATION STYLE

APA

Kocher, D., & Augsten, N. (2019). A scalable index for top-k subtree similarity queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data (pp. 1624–1641). Association for Computing Machinery. https://doi.org/10.1145/3299869.3319892

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free