Data-driven subword segmentation has become the default strategy for open-vocabulary machine translation and other NLP tasks, but may not be sufficiently generic for optimal learning of non-concatenative morphology. We design a test suite to evaluate segmentation strategies on different types of morphological phenomena in a controlled, semisynthetic setting. In our experiments, we compare how well machine translation models trained on subword- and character-level can translate these morphological phenomena. We find that learning to analyse and generate morphologically complex surface representations is still challenging, especially for nonconcatenative morphological phenomena like reduplication or vowel harmony and for rare word stems. Based on our results, we recommend that novel text representation strategies be tested on a range of typologically diverse languages to minimise the risk of adopting a strategy that inadvertently disadvantages certain languages.
CITATION STYLE
Amrhein, C., & Sennrich, R. (2021). How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology? In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (pp. 689–705). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-emnlp.60
Mendeley helps you to discover research relevant for your work.