Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Soumil Mandal; Karthick Nanmaran

Conference ProceedingsOPEN ACCESS

Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

4th Workshop on Noisy User-Generated Text, W-NUT 2018 - Proceedings of the Workshop (2018) 49-53

DOI: 10.18653/v1/w18-6107

10Citations

100Readers

Abstract

Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data.

Cite

CITATION STYLE

APA

Mandal, S., & Nanmaran, K. (2018). Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance. In 4th Workshop on Noisy User-Generated Text, W-NUT 2018 - Proceedings of the Workshop (pp. 49–53). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/w18-6107

Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Abstract

Cite

Register to see more suggestions