Statistical approach to noisy-parallel and comparable corpora filtering for the extraction of bi-lingual equivalent data at sentence-level

1Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.

Cite

CITATION STYLE

APA

Wołk, K., Zawadzka, E., & Wołk, A. (2018). Statistical approach to noisy-parallel and comparable corpora filtering for the extraction of bi-lingual equivalent data at sentence-level. In Advances in Intelligent Systems and Computing (Vol. 745, pp. 797–812). Springer Verlag. https://doi.org/10.1007/978-3-319-77703-0_79

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free