Efficient extraction of pseudo-Parallel sentences from raw monolingual data using word embeddings

18Citations
Citations of this article
95Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We propose a new method for extracting pseudo-parallel sentences from a pair of large monolingual corpora, without relying on any document-level information. Our method first exploits word embeddings in order to efficiently evaluate trillions of candidate sentence pairs and then a classifier to find the most reliable ones. We report significant improvements in domain adaptation for statistical machine translation when using a translation model trained on the sentence pairs extracted from in-domain monolingual corpora.

Cite

CITATION STYLE

APA

Marie, B., & Fujita, A. (2017). Efficient extraction of pseudo-Parallel sentences from raw monolingual data using word embeddings. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (Vol. 2, pp. 392–398). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-2062

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free