Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

10Citations
Citations of this article
100Readers
Mendeley users who have this article in their library.

Abstract

Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data.

Cite

CITATION STYLE

APA

Mandal, S., & Nanmaran, K. (2018). Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance. In 4th Workshop on Noisy User-Generated Text, W-NUT 2018 - Proceedings of the Workshop (pp. 49–53). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w18-6107

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free