Abstract
Neural approaches, which are currently state-of-the-art in many areas, have contributed significantly to the exciting advancements in machine translation. However, Neural Machine Translation (NMT) requires a substantial quantity and good quality parallel training data to train the best model. A large amount of training data, in turn, increases the underlying vocabulary exponentially. Therefore, several proposed methods have been devised for relatively limited vocabulary due to constraints of computing resources such as system memory. Encoding words as sequences of subword units for so-called open-vocabulary translation is an effective strategy for solving this problem. However, the conventional methods for splitting words into subwords focus on statistics-based approaches that mainly conform to agglutinative languages. In these languages, the morphemes have relatively clean boundaries. These methods still need to be thoroughly investigated for their applicability to fusion languages, which is the main focus of this article. Phonological and orthographic processes alter the borders of constituent morphemes of a word in fusion languages. Therefore, it makes it difficult to distinguish the actual morphemes that carry syntactic or semantic information from the word's surface form, the form of the word as it appears in the text. We, thus, resorted to a word segmentation method that segments words by restoring the altered morphemes. We also compared conventional and morpheme-based NMT subword models. We could prove that morpheme-based models outperform conventional subword models on a benchmark dataset.
Author supplied keywords
Cite
CITATION STYLE
Gezmu, A. M., & Nürnberger, A. (2023). Morpheme-Based Neural Machine Translation Models for Low-Resource Fusion Languages. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(9). https://doi.org/10.1145/3610773
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.