PALI: A Language Identification Benchmark for Perso-Arabic Scripts

3Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.

Abstract

The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where “unconventional” writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.

References Powered by Scopus

Automatic language identification in texts: A survey

107Citations
N/AReaders
Get full text

Natural language processing for similar languages, varieties, and dialects: A survey

36Citations
N/AReaders
Get full text

When sparse traditional models outperform dense neural networks: The curious case of discriminating between similar languages

28Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

2Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Ahmadi, S., Agarwal, M., & Anastasopoulos, A. (2023). PALI: A Language Identification Benchmark for Perso-Arabic Scripts. In ACL 2023 - 10th Workshop on NLP for Similar Languages, Varieties and Dialects, VarDial 2023 - Proceedings of the Workshop (pp. 78–90). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.vardial-1.8

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 4

67%

Lecturer / Post doc 1

17%

Researcher 1

17%

Readers' Discipline

Tooltip

Computer Science 9

82%

Medicine and Dentistry 1

9%

Neuroscience 1

9%

Save time finding and organizing research with Mendeley

Sign up for free