Top-k Tree Similarity Join

Jianhua Wang; Jianye Yang; Wenjie Zhang

Conference ProceedingsOPEN ACCESS

Top-k Tree Similarity Join

International Conference on Information and Knowledge Management, Proceedings (2021) 1939-1948

DOI: 10.1145/3459637.3482304

1Citations

5Readers

Get full text

Abstract

Tree similarity join is useful for analyzing tree structured data. The traditional threshold-based tree similarity join requires a similarity threshold, which is usually a difficult task for users. To remedy this issue, we advocate the problem of top-k tree similarity join. Given a collection of trees and a parameter k, the top-k tree similarity join aims to find k tree pairs with minimum tree edit distance (TED). Although we show that this problem can be resolved by utilizing the threshold-based join, the efficiency is unsatisfactory. In this paper, we propose an efficient algorithm, namely TopKTJoin, which generates the candidate tree pairs incrementally using an inverted index. We also derive TED lower bound for the unseen tree pairs. Together with TED value of the k-th best join result seen so far, we have a chance to terminate the algorithm early without missing any correct results. To further improve the efficiency, we propose two optimization techniques in terms of index structure and verification mechanism. We conduct comprehensive performance studies on real and synthetic datasets. The experimental results demonstrate that TopKTJoin significantly outperforms the baseline method.

Author supplied keywords

Cite

CITATION STYLE

APA

Wang, J., Yang, J., & Zhang, W. (2021). Top-k Tree Similarity Join. In International Conference on Information and Knowledge Management, Proceedings (pp. 1939–1948). Association for Computing Machinery. https://doi.org/10.1145/3459637.3482304

Top-k Tree Similarity Join

Abstract

Author supplied keywords

Cite

Register to see more suggestions