MTDOT: A Multilingual Translation-Based Data Augmentation Technique for Offensive Content Identification in Tamil Text Data

12Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.

Abstract

The posting of offensive content in regional languages has increased as a result of the accessibility of low-cost internet and the widespread use of online social media. Despite the large number of comments available online, only a small percentage of them are offensive, resulting in an unequal distribution of offensive and non-offensive comments. Due to this class imbalance, classifiers may be biased toward the class with the most samples, i.e., the non-offensive class. To address class imbalance, a Multilingual Translation-based Data augmentation technique for Offensive content identification in Tamil text data (MTDOT) is proposed in this work. The proposed MTDOT method is applied to HASOC’21, which is the Tamil offensive content dataset. To obtain a balanced dataset, each offensive comment is augmented using multi-level back translation with English and Malayalam as intermediate languages. Another balanced dataset is generated by employing single-level back translation with Malayalam, Kannada, and Telugu as intermediate languages. While both approaches are equally effective, the proposed multi-level back-translation data augmentation approach produces more diverse data, which is evident from the BLEU score. The MTDOT technique proposed in this work achieved a promising improvement in (Formula presented.) -score over the widely used SMOTE class balancing method by 65%.

Cite

CITATION STYLE

APA

Ganganwar, V., & Rajalakshmi, R. (2022). MTDOT: A Multilingual Translation-Based Data Augmentation Technique for Offensive Content Identification in Tamil Text Data. Electronics (Switzerland), 11(21). https://doi.org/10.3390/electronics11213574

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free