Abstract
We propose a novel model which can be used to align the sentences of two different languages using neural architectures. First, we train our model to get the bilingual word embeddings and then, we create a similarity matrix between the words of the two sentences. Because of different lengths of the sentences involved, we get a matrix of varying dimension. We dynamically pool the similarity matrix into a matrix of fixed dimension and use Convolutional Neural Network (CNN) to classify the sentences as aligned or not. To further improve upon this technique, we bucket the sentence pairs to be classified into different groups and train CNN's separately. Our approach not only solves sentence alignment problem but our model can be regarded as a generic bag-of-words similarity measure for monolingual or bilingual corpora.
Cite
CITATION STYLE
Grover, J., & Mitra, P. (2017). Bilingual word embeddings with bucketed CNN for parallel sentence extraction. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop (pp. 11–16). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-3003
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.