My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

0Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 10 million social media sentences for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like codemixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated ~12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus significantly outperform the existing state-of-the-art BERT models. This is the first work that presents artifacts for code-mixed Marathi research. All datasets and models are publicly released at https://github.com/l3cube-pune/MarathiNLP.

Cite

CITATION STYLE

APA

Chavan, T., Gokhale, O., Kane, A., Patankar, S., & Joshi, R. (2023). My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks. In IJCNLP-AACL 2023 - 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (pp. 242–249). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-ijcnlp.22

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free