On the use of comparable corpora to improve SMT performance

Sadaf Abdul-Rauf; Holger Schwenk

Conference Proceedings

On the use of comparable corpora to improve SMT performance

EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings (2009) 16-23

DOI: 10.3115/1609067.1609068

100Citations

106Readers

Get full text

Abstract

We present a simple and effective method for extracting parallel sentences from comparable corpora. We employ a statistical machine translation (SMT) system built from small amounts of parallel texts to translate the source side of the non-parallel corpus. The target side texts are used, along with other corpora, in the language model of this SMT system. We then use information retrieval techniques and simple filters to create French/English parallel data from a comparable news corpora. We evaluate the quality of the extracted data by showing that it significantly improves the performance of an SMT systems. © 2009 Association for Computational Linguistics.

Cite

CITATION STYLE

APA

Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. In EACL 2009 - 12th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings (pp. 16–23). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1609067.1609068

On the use of comparable corpora to improve SMT performance

Abstract

Cite

Register to see more suggestions