Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods

15Citations
Citations of this article
86Readers
Mendeley users who have this article in their library.

Abstract

Data selection is a process used in selecting a subset of parallel data for the training of machine translation (MT) systems, so that 1) resources for training might be reduced, 2) trained models could perform better than those trained with the whole corpus, and/or 3) trained models are more tailored to specific domains. It has been shown that for statistical MT (SMT), the use of data selection helps improve the MT performance significantly. In this study, we reviewed three data selection approaches for MT, namely Term Frequency–Inverse Document Frequency, Cross-Entropy Difference and Feature Decay Algorithm, and conducted experiments on Neural Machine Translation (NMT) with the selected data using the three approaches. The results showed that for NMT systems, using data selection also improved the performance, though the gain is not as much as for SMT systems.

Cite

CITATION STYLE

APA

Silva, C. C., Liu, C. H., Poncelas, A., & Way, A. (2018). Extracting In-domain Training Corpora for Neural Machine Translation Using Data Selection Methods. In WMT 2018 - 3rd Conference on Machine Translation, Proceedings of the Conference (Vol. 1, pp. 224–231). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w18-6323

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free