How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Chantal Amrhein; Rico Sennrich

Conference ProceedingsOPEN ACCESS

How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (2021) 689-705

DOI: 10.18653/v1/2021.findings-emnlp.60

9Citations

55Readers

Abstract

Data-driven subword segmentation has become the default strategy for open-vocabulary machine translation and other NLP tasks, but may not be sufficiently generic for optimal learning of non-concatenative morphology. We design a test suite to evaluate segmentation strategies on different types of morphological phenomena in a controlled, semisynthetic setting. In our experiments, we compare how well machine translation models trained on subword- and character-level can translate these morphological phenomena. We find that learning to analyse and generate morphologically complex surface representations is still challenging, especially for nonconcatenative morphological phenomena like reduplication or vowel harmony and for rare word stems. Based on our results, we recommend that novel text representation strategies be tested on a range of typologically diverse languages to minimise the risk of adopting a strategy that inadvertently disadvantages certain languages.

Cite

CITATION STYLE

APA

Amrhein, C., & Sennrich, R. (2021). How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology? In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (pp. 689–705). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-emnlp.60

How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Abstract

Cite

Register to see more suggestions