Accurately computing the similarity between two texts written in different languages has tremendous value in many applications, such as cross-lingual information retrieval and cross-lingual text mining/analytics. This paper studies the important problem based on neural networks. Specifically, our focus is on the neural machine translation models. While translation models are utilized, we pay special attention not to the translation itself but to the intermediate states of given texts stored in the translation models. Our assumption is that the intermediate states capture the syntactic and semantic meaning of input texts and are a good representation of the texts, avoiding inevitable translation errors. To study the validity of the assumption, we investigate the utility of the intermediates states and their effectiveness in computing cross-lingual text similarity in comparison with other neural network-based distributed representations of texts, including word and paragraph embedding-based approaches. We demonstrate that an approach using the intermediate states outperforms not only these approaches but also a strong machine translation-based one. Furthermore, it is revealed that intermediate states and translated texts work complementarily each other despite the fact that they are generated from the same NMT models.
CITATION STYLE
Seki, K. (2019). On cross-lingual text similarity using neural translation models. Journal of Information Processing, 27, 315–321. https://doi.org/10.2197/ipsjjip.27.315
Mendeley helps you to discover research relevant for your work.