Morphology matters: A multilingual language modeling analysis

Hyunji Hayley Park; Katherine J. Zhang; Coleman Haley; Kenneth Steimel; Han Liu; Lane Schwartz

Journal ArticleOPEN ACCESS

Morphology matters: A multilingual language modeling analysis

Transactions of the Association for Computational Linguistics (2021) 9 261-276

DOI: 10.1162/tacl_a_00365

37Citations

88Readers

Abstract

Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a lan-guage’s morphology on language modeling.

Cite

CITATION STYLE

APA

Park, H. H., Zhang, K. J., Haley, C., Steimel, K., Liu, H., & Schwartz, L. (2021). Morphology matters: A multilingual language modeling analysis. Transactions of the Association for Computational Linguistics, 9, 261–276. https://doi.org/10.1162/tacl_a_00365

Morphology matters: A multilingual language modeling analysis

Abstract

Cite

Register to see more suggestions