Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages

3Citations
Citations of this article
49Readers
Mendeley users who have this article in their library.

Abstract

Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we focus on subword segmentation and evaluate Linguistically Motivated Vocabulary Reduction (LMVR) against the more commonly used SentencePiece (SP) for the task of translating from English into four different Dravidian languages. Additionally we investigate the optimal subword vocabulary size for each language. We find that SP is the overall best choice for segmentation, and that larger subword vocabulary sizes lead to higher translation quality.

Cite

CITATION STYLE

APA

Dhar, P., Bisazza, A., & van Noord, G. (2021). Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages. In WAT 2021 - 8th Workshop on Asian Translation, Proceedings of the Workshop (pp. 181–190). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.wat-1.21

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free