Improving machine translation performance by exploiting non-parallel corpora

Dragos Stefan Munteanu; Daniel Marcu

Journal ArticleOPEN ACCESS

Improving machine translation performance by exploiting non-parallel corpora

Computational Linguistics (2005) 31(4) 477-504

DOI: 10.1162/089120105775299168

315Citations

207Readers

Abstract

We present a novel method for discovering parallel sentences in comparable, non-parallel corpora. We train a maximum entropy classifier that, given a pair of sentences, can reliably determine whether or not they are translations of each other. Using this approach, we extract parallel data from large Chinese, Arabic, and English non-parallel newspaper corpora. We evaluate the quality of the extracted data by showing that it improves the performance of a state-of-the-art statistical machine translation system. We also show that a good-quality MT system can be built from scratch by starting with a very small parallel corpus (100,000 words) and exploiting a large non-parallel corpus. Thus, our method can be applied with great benefit to language pairs for which only scarce resources are available. © 2006 Association for Computational Linguistics.

Cite

CITATION STYLE

APA

Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504. https://doi.org/10.1162/089120105775299168

Improving machine translation performance by exploiting non-parallel corpora

Abstract

Cite

Register to see more suggestions