Analyzing Cognitive Plausibility of Subword Tokenization

4Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.

Cite

CITATION STYLE

APA

Beinborn, L., & Pinter, Y. (2023). Analyzing Cognitive Plausibility of Subword Tokenization. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 4478–4486). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.272

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free