Abstract
We propose a new method for extracting pseudo-parallel sentences from a pair of large monolingual corpora, without relying on any document-level information. Our method first exploits word embeddings in order to efficiently evaluate trillions of candidate sentence pairs and then a classifier to find the most reliable ones. We report significant improvements in domain adaptation for statistical machine translation when using a translation model trained on the sentence pairs extracted from in-domain monolingual corpora.
Cite
CITATION STYLE
Marie, B., & Fujita, A. (2017). Efficient extraction of pseudo-Parallel sentences from raw monolingual data using word embeddings. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (Vol. 2, pp. 392–398). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-2062
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.