BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

10Citations
Citations of this article
45Readers
Mendeley users who have this article in their library.

Abstract

Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri-Spanish.

Cite

CITATION STYLE

APA

Mager, M., Oncevay, A., Mager, E., Kann, K., & Vu, N. T. (2022). BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 961–971). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-acl.78

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free