Transformer Based Language Identification for Malayalam-English Code-Mixed Text

57Citations
Citations of this article
56Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Social media users have the proclivity to write majority of the data for under resourced languages in code-mixed format. Code-mixing is defined as mixing of two or more languages in a single sentence. Research in code-mixed text helps apprehend security threats, prevalent on social media platforms. In such instances, language identification is an imperative task of code-mixed text. The focus of this paper is to carry out a word-level language identification (WLLI) of Malayalam-English code-mixed data, from social media platforms like YouTube. This study was centered around BERT, a transformer model, along with its variants - CamemBERT, DistilBERT - for intuitive perception of the language at the word-level. The propounded approach entails tagging Malayalam-English code-mixed data set with six labels: Malayalam (mal), English (eng), acronyms (acr), universal (univ), mixed (mix) and undefined (undef). Newly developed corpus of Malayalam-English was deployed for appraisal of the effectiveness of state-of-the-art models like BERT. Evaluation of the proffered approach, accomplished with other code-mixed language such as Hindi-English, notched a 9% increase in the F1-score.

Cite

CITATION STYLE

APA

Thara, S., & Poornachandran, P. (2021). Transformer Based Language Identification for Malayalam-English Code-Mixed Text. IEEE Access, 9, 118837–118850. https://doi.org/10.1109/ACCESS.2021.3104106

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free