Benchmarking Post-Hoc Interpretability Approaches for Transformer-based Misogyny Detection

21Citations
Citations of this article
44Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Warning: This paper contains examples of language that some people may find offensive. Transformer-based Natural Language Processing models have become the standard for hate speech detection. However, the unconscious use of these techniques for such a critical task comes with negative consequences. Various works have demonstrated that hate speech classifiers are biased. These findings have prompted efforts to explain classifiers, mainly using attribution methods. In this paper, we provide the first benchmark study of interpretability approaches for hate speech detection. We cover four post-hoc token attribution approaches to explain the predictions of Transformer-based misogyny classifiers in English and Italian. Further, we compare generated attributions to attention analysis. We find that only two algorithms provide faithful explanations aligned with human expectations. Gradient-based methods and attention, however, show inconsistent outputs, making their value for explanations questionable for hate speech detection tasks.

Cite

CITATION STYLE

APA

Attanasio, G., Nozza, D., Pastor, E., & Hovy, D. (2022). Benchmarking Post-Hoc Interpretability Approaches for Transformer-based Misogyny Detection. In NLP-Power 2022 - 1st Workshop on Efficient Benchmarking in NLP, Proceedings of the Workshop (pp. 100–112). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.nlppower-1.11

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free