How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

9Citations
Citations of this article
55Readers
Mendeley users who have this article in their library.

Abstract

Data-driven subword segmentation has become the default strategy for open-vocabulary machine translation and other NLP tasks, but may not be sufficiently generic for optimal learning of non-concatenative morphology. We design a test suite to evaluate segmentation strategies on different types of morphological phenomena in a controlled, semisynthetic setting. In our experiments, we compare how well machine translation models trained on subword- and character-level can translate these morphological phenomena. We find that learning to analyse and generate morphologically complex surface representations is still challenging, especially for nonconcatenative morphological phenomena like reduplication or vowel harmony and for rare word stems. Based on our results, we recommend that novel text representation strategies be tested on a range of typologically diverse languages to minimise the risk of adopting a strategy that inadvertently disadvantages certain languages.

Cite

CITATION STYLE

APA

Amrhein, C., & Sennrich, R. (2021). How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology? In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (pp. 689–705). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-emnlp.60

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free