Fortification of neural morphological segmentation models for polysynthetic minimal-resource languages

32Citations
Citations of this article
108Readers
Mendeley users who have this article in their library.

Abstract

Morphological segmentation for polysynthetic languages is challenging, because a word may consist of many individual morphemes and training data can be extremely scarce. Since neural sequence-To-sequence (seq2seq) models define the state of the art for morphological segmentation in high-resource settings and for (mostly) European languages, we first show that they also obtain competitive performance for Mexican polysynthetic languages in minimal-resource settings. We then propose two novel multi-Task training approaches- one with, one without need for external unlabeled resources-, and two corresponding data augmentation methods, improving over the neural baseline for all languages. Finally, we explore cross-lingual transfer as a third way to fortify our neural model and show that we can train one single multi-lingual model for related languages while maintaining comparable or even improved performance, thus reducing the amount of parameters by close to 75%. We provide our morphological segmentation datasets for Mexicanero, Nahuatl, Wixarika and Yorem Nokki for future research.

Cite

CITATION STYLE

APA

Kann, K., Mager, M., Meza-Ruiz, I., & Schütze, H. (2018). Fortification of neural morphological segmentation models for polysynthetic minimal-resource languages. In NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference (Vol. 1, pp. 47–57). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/n18-1005

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free