From characters to words: The turning point of BPE merges

Ximena Gutierrez-Vasques; Christian Bentz; Olga Sozinova; Tanja Samardžić

Conference ProceedingsOPEN ACCESS

From characters to words: The turning point of BPE merges

EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (2021) 3454-3468

DOI: 10.18653/v1/2021.eacl-main.302

14Citations

69Readers

Abstract

The distributions of orthographic word types are very different across languages due to typological characteristics, different writing traditions, and other factors. The wide range of cross-linguistic diversity is still a major challenge for NLP, and for the study of language more generally. We use BPE and information-theoretic measures to investigate if distributions become more similar under specific levels of subword tokenization. We perform a cross-linguistic comparison, following incremental BPE merges (we go from characters to words) for 47 diverse languages. We show that text entropy values (a feature of probability distributions) converge at specific subword levels: relatively few BPE merges (around 200 for our corpus) lead to the most similar distributions across languages. Additionally, we analyze the interaction between subword and word-level distributions and show that our findings can be interpreted in light of the ongoing discussion about different morphological complexity types.

Cite

CITATION STYLE

APA

Gutierrez-Vasques, X., Bentz, C., Sozinova, O., & Samardžić, T. (2021). From characters to words: The turning point of BPE merges. In EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 3454–3468). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.eacl-main.302

From characters to words: The turning point of BPE merges

Abstract

Cite

Register to see more suggestions