Morphological segmentation method for Turkic language neural machine translation

23Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmenting each word into morphemes in parallel source corpora. Therefore, this study introduces a new morphological segmentation approach for Turkic languages based on the complete set of endings (CSE), which reduces the vocabulary volume of the source corpora. Herein, we demonstrate the proposed CSE-based morphological segmentation method for the Kazakh, Kyrgyz, and Uzbek languages and present the results of computational NMT experiments for the Kazakh language. The NMT experiment results show that in comparison with byte-pair encoding (BPE)-based segmentation, the proposed CSE-based segmentation increases the bilingual evaluation understudy score of 0.5 and 0.2 points on average for Kazakh–English and English–Kazakh pairs, respectively. Furthermore, in comparison with the BPE-based segmentation, the proposed CSE-based segmentation approach reduced the vocabulary size in NMT by more than a factor of two. This feature of the proposed segmentation approach will be crucial for NMT as the size of the source corpora is increased to improve translation quality.

Cite

CITATION STYLE

APA

Tukeyev, U., Karibayeva, A., & Zhumanov, Z. H. (2020). Morphological segmentation method for Turkic language neural machine translation. Cogent Engineering, 7(1). https://doi.org/10.1080/23311916.2020.1856500

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free