Optimizing the Size of Subword Vocabularies in Dialect Classification

4Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.

Abstract

Pre-trained models usually come with a predefined tokenization and little flexibility as to what subword tokens can be used in downstream tasks. This problem concerns especially multilingual NLP and low-resource languages, which are typically processed using cross-lingual transfer. In this paper, we aim to find out if the right granularity of tokenization is helpful for a text classification task, namely dialect classification. Aiming at generalizations beyond the studied cases, we look for the optimal granularity in four dialect datasets, two with relatively consistent writing (one Arabic and one Indo-Aryan set) and two with considerably inconsistent writing (one Arabic and one Swiss German set). To gain more control over subword tokenization and ensure direct comparability in the experimental settings, we train a CNN classifier from scratch comparing two subword tokenization methods (Unigram model and BPE). For reference, we compare the results obtained in our analysis to the state of the art achieved by fine-tuning pre-trained models. We show that models trained from scratch with an optimal tokenization level perform better than fine-tuned classifiers in the case of highly inconsistent writing. In the case of relatively consistent writing, fine-tuned models remain better regardless of the tokenization level.

Cite

CITATION STYLE

APA

Kanjirangat, V., Samardžić, T., Dolamic, L., & Rinaldi, F. (2023). Optimizing the Size of Subword Vocabularies in Dialect Classification. In ACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2023 - Proceedings of the Workshop (pp. 14–30). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.vardial-1.2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free