Playing the Part of the Sharp Bully: Generating Adversarial Examples for Implicit Hate Speech Detection

20Citations
Citations of this article
32Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Research on abusive content detection on social media has primarily focused on explicit forms of hate speech (HS), that are often identifiable by recognizing hateful words and expressions. Messages containing linguistically subtle and implicit forms of hate speech still constitute an open challenge for automatic hate speech detection. In this paper, we propose a new framework for generating adversarial implicit HS short-text messages using Auto-regressive Language Models. Moreover, we propose a strategy to group the generated implicit messages by their complexity levels (EASY, MEDIUM, and HARD categories) characterizing how challenging these messages are for supervised classifiers. Finally, relying on (Dinan et al., 2019; Vidgen et al., 2021), we propose a “build it, break it, fix it”, training scheme using HARD messages showing how iteratively retraining on HARD messages substantially leverages SOTA models' performances on implicit HS benchmarks.

Cite

CITATION STYLE

APA

Ocampo, N., Cabrio, E., & Villata, S. (2023). Playing the Part of the Sharp Bully: Generating Adversarial Examples for Implicit Hate Speech Detection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 2758–2772). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.173

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free