Mitigating Biases in Toxic Language Detection through Invariant Rationalization

7Citations
Citations of this article
57Readers
Mendeley users who have this article in their library.

Abstract

Automatic detection of toxic language plays an essential role in protecting social media users, especially minority groups, from verbal abuse. However, biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection. The biases make the learned models unfair and can even exacerbate the marginalization of people. Considering that current debiasing methods for general natural language understanding tasks cannot effectively mitigate the biases in the toxicity detectors, we propose to use invariant rationalization (INVRAT), a game-theoretic framework consisting of a rationale generator and predictors, to rule out the spurious correlation of certain syntactic patterns (e.g., identity mentions, dialect) to toxicity labels. We empirically show that our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.

Cite

CITATION STYLE

APA

Chuang, Y. S., Gao, M., Luo, H., Glass, J., Lee, H. Y., Chen, Y. N., & Li, S. W. (2021). Mitigating Biases in Toxic Language Detection through Invariant Rationalization. In WOAH 2021 - 5th Workshop on Online Abuse and Harms, Proceedings of the Workshop (pp. 114–120). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.woah-1.12

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free