Efficient extraction of pseudo-Parallel sentences from raw monolingual data using word embeddings

Benjamin Marie; Atsushi Fujita

Conference Proceedings

Efficient extraction of pseudo-Parallel sentences from raw monolingual data using word embeddings

ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (2017) 2 392-398

DOI: 10.18653/v1/P17-2062

18Citations

95Readers

Get full text

Abstract

We propose a new method for extracting pseudo-parallel sentences from a pair of large monolingual corpora, without relying on any document-level information. Our method first exploits word embeddings in order to efficiently evaluate trillions of candidate sentence pairs and then a classifier to find the most reliable ones. We report significant improvements in domain adaptation for statistical machine translation when using a translation model trained on the sentence pairs extracted from in-domain monolingual corpora.

Cite

CITATION STYLE

APA

Marie, B., & Fujita, A. (2017). Efficient extraction of pseudo-Parallel sentences from raw monolingual data using word embeddings. In ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (Vol. 2, pp. 392–398). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/P17-2062

Efficient extraction of pseudo-Parallel sentences from raw monolingual data using word embeddings

Abstract

Cite

Register to see more suggestions