Abstract
This paper deals with the problem of incremental dialect identification. Our goal is to reliably determine the dialect before the full utterance is given as input. The major part of the previous research on dialect identification has been model-centric, focusing on performance. We address a new question: How much input is needed to identify a dialect? Our approach is a data-centric analysis that results in general criteria for finding the shortest input needed to make a plausible guess. Working with three sets of language dialects (Swiss German, Indo-Aryan and Arabic languages), we show that it is possible to generalize across dialects and datasets with two input shortening criteria: model confidence and minimal input length (adjusted for the input type). The source code for experimental analysis can be found at Github.
Cite
CITATION STYLE
Kanjirangat, V., Samardzic, T., Rinaldi, F., & Dolamic, L. (2022). Early Guessing for Dialect Identification. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 6446–6455). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-emnlp.276
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.