Statistical approach to noisy-parallel and comparable corpora filtering for the extraction of bi-lingual equivalent data at sentence-level

Krzysztof Wołk; Emilia Zawadzka; Agnieszka Wołk

Conference Proceedings

Statistical approach to noisy-parallel and comparable corpora filtering for the extraction of bi-lingual equivalent data at sentence-level

Advances in Intelligent Systems and Computing (2018) 745 797-812

DOI: 10.1007/978-3-319-77703-0_79

1Citations

1Readers

Get full text

Abstract

Text alignment and text quality are critical to the accuracy of Machine Translation (MT) systems and other text processing tasks requiring bilingual data. In this study, we propose a language independent bi-sentence filtering approach based on Polish to English translation. This approach was developed using a noisy TED Talks corpus and tested on a Wikipedia-based comparable corpus; however, it can be extended to any text domain or language pair. The proposed method uses various statistical measures for sentence comparison and can be used for in-domain data adaptation tasks as well. Minimization of data loss was ensured by parameter adaptation. An improvement in MT system score using the text processed using our tool is discussed and in-domain data adaptation results are presented. We also discuss measures to improve performance such as bootstrapping and comparison model pruning. The results show significant improvement in filtering in terms of MT quality.

Author supplied keywords

Cite

CITATION STYLE

APA

Wołk, K., Zawadzka, E., & Wołk, A. (2018). Statistical approach to noisy-parallel and comparable corpora filtering for the extraction of bi-lingual equivalent data at sentence-level. In Advances in Intelligent Systems and Computing (Vol. 745, pp. 797–812). Springer Verlag. https://doi.org/10.1007/978-3-319-77703-0_79

Statistical approach to noisy-parallel and comparable corpora filtering for the extraction of bi-lingual equivalent data at sentence-level

Abstract

Author supplied keywords

Cite

Register to see more suggestions