Automatic selection of parallel data for machine translation

4Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Nowadays machine translation is widely used, but the required data for training, tuning and testing a machine translation engine is often not suffi-cient or not useful. The automatic selection of data that are qualitatively appropriate for building translation models can help improve translation accu-racy. In this paper, we used a large parallel corpus of educational video lecture subtitles as well as text posted by students and lecturers on the course fora. The text is quite challenging to translate due to the scientific domains involved and its informal genre. We applied a random forest classification schema on the output of three machine translation models (one based on statistical machine translation and two on neural machine translation) in order to automatically identify the best output. The unorthodox language phenomena observed as well as the rich-in-terminology scientific domains addressed in the educational video lectures, the language-independent nature of the approach, and the tackled three-class classification problem constitute innovative challenges of the work described herein.

Cite

CITATION STYLE

APA

Mouratidis, D., & Kermanidis, K. L. (2018). Automatic selection of parallel data for machine translation. In IFIP Advances in Information and Communication Technology (Vol. 520, pp. 146–156). Springer New York LLC. https://doi.org/10.1007/978-3-319-92016-0_14

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free