Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks

0Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.

Abstract

Online social media is rife with offensive and hateful comments, prompting the need for their automatic detection given the sheer amount of posts created every second. Creating highquality human-labelled datasets for this task is difficult and costly, especially because nonoffensive posts are significantly more frequent than offensive ones. However, unlabelled data is abundant, easier, and cheaper to obtain. In this scenario, self-training methods, using weakly-labelled examples to increase the amount of training data, can be employed. Recent "noisy" self-training approaches incorporate data augmentation techniques to ensure prediction consistency and increase robustness against noisy data and adversarial attacks. In this paper, we experiment with default and noisy self-training using three different textual data augmentation techniques across five different pre-trained BERT architectures varying in size. We evaluate our experiments on two offensive/hate-speech datasets and demonstrate that (i) self-training consistently improves performance regardless of model size, resulting in up to +1.5% F1-macro on both datasets, and (ii) noisy self-training with textual data augmentations, despite being successfully applied in similar settings, decreases performance on offensive and hate-speech domains when compared to the default method, even with state-ofthe-art augmentations such as backtranslation.

References Powered by Scopus

ImageNet: A Large-Scale Hierarchical Image Database

51108Citations
N/AReaders
Get full text

WordNet: A Lexical Database for English

11663Citations
N/AReaders
Get full text

Combining labeled and unlabeled data with co-training

4647Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Leite, J. A., Scarton, C., & Silva, D. F. (2023). Noisy Self-Training with Data Augmentations for Offensive and Hate Speech Detection Tasks. In International Conference Recent Advances in Natural Language Processing, RANLP (pp. 631–640). Incoma Ltd. https://doi.org/10.26615/978-954-452-092-2_068

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 3

60%

Lecturer / Post doc 1

20%

Researcher 1

20%

Readers' Discipline

Tooltip

Computer Science 5

63%

Medicine and Dentistry 1

13%

Mathematics 1

13%

Neuroscience 1

13%

Save time finding and organizing research with Mendeley

Sign up for free