Convolutional Embedding for Edit Distance

Xinyan Dai; Xiao Yan; Kaiwen Zhou; Yuxuan Wang; Han Yang; James Cheng

Conference ProceedingsOPEN ACCESS

Convolutional Embedding for Edit Distance

SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2020) 599-608

DOI: 10.1145/3397271.3401045

17Citations

48Readers

Get full text

Abstract

Edit-distance-based string similarity search has many applications such as spell correction, data de-duplication, and sequence alignment. However, computing edit distance is known to have high complexity, which makes string similarity search challenging for large datasets. In this paper, we propose a deep learning pipeline (called CNN-ED) that embeds edit distance into Euclidean distance for fast approximate similarity search. A convolutional neural network (CNN) is used to generate fixed-length vector embeddings for a dataset of strings and the loss function is a combination of the triplet loss and the approximation error. To justify our choice of using CNN instead of other structures (e.g., RNN) as the model, theoretical analysis is conducted to show that some basic operations in our CNN model preserve edit distance. Experimental results show that CNN-ED outperforms data-independent CGK embedding and RNN-based GRU embedding in terms of both accuracy and efficiency by a large margin. We also show that string similarity search can be significantly accelerated using CNN-based embeddings, sometimes by orders of magnitude.

Author supplied keywords

Cite

CITATION STYLE

APA

Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., & Cheng, J. (2020). Convolutional Embedding for Edit Distance. In SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 599–608). Association for Computing Machinery, Inc. https://doi.org/10.1145/3397271.3401045

Convolutional Embedding for Edit Distance

Abstract

Author supplied keywords

Cite

Register to see more suggestions