Distinguishing Romanized Hindi from Romanized Urdu

Elizabeth Nielsen; Christo Kirov; Brian Roark

Conference ProceedingsOPEN ACCESS

Distinguishing Romanized Hindi from Romanized Urdu

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 33-42

DOI: 10.18653/v1/2023.cawl-1.5

1Citations

5Readers

Abstract

We examine the task of distinguishing between Hindi and Urdu when those languages are romanized, i.e., written in the Latin script. Both languages are widely informally romanized, and to the extent that they are identified in the Latin script by language identification systems, they are typically conflated. In the absence of large labeled collections of such text, we consider methods for generating training data. Beginning with a small set of seed words, each of which are strongly indicative of one of the languages versus the other, we prompt a pretrained large language model (LLM) to generate romanized text. Treating text generated from an Urdu prompt as one class and text generated from a Hindi prompt as the other class, we build a binary language identification (LangID) classifier. We demonstrate that the resulting classifier distinguishes manually romanized Urdu Wikipedia text from manually romanized Hindi Wikipedia text far better than chance. We use this classifier to estimate the prevalence of Urdu in a large collection of text labeled as romanized Hindi that has been used to train large language models. These techniques can be applied to bootstrap classifiers in other cases where a dataset is known to contain multiple distinct but related classes, such as different dialects of the same language, but for which labels cannot easily be obtained.

Cite

CITATION STYLE

APA

Nielsen, E., Kirov, C., & Roark, B. (2023). Distinguishing Romanized Hindi from Romanized Urdu. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 33–42). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.cawl-1.5

Distinguishing Romanized Hindi from Romanized Urdu

Abstract

Cite

Register to see more suggestions