Warning: This paper discusses and contains offensive or upsetting content. Nowadays, many hate speech detectors are built to automatically detect hateful content. However, their training sets are sometimes skewed towards certain stereotypes (e.g., race or religion-related). As a result, the detectors are prone to depend on some shortcuts for predictions. Previous works mainly focus on token-level analysis and heavily rely on human experts' annotations to identify spurious correlations, which is not only costly but also incapable of discovering higher-level artifacts. In this work, we use grammar induction to find grammar patterns for hate speech and analyze this phenomenon from a causal perspective. Concretely, we categorize and verify different biases based on their spuriousness and influence on the model prediction. Then, we propose two mitigation approaches including Multi-Task Intervention and Data-Specific Intervention based on these confounders. Experiments conducted on 9 hate speech datasets demonstrate the effectiveness of our approaches. The code is available at https://github.com/SALT-NLP/Bias_Hate_Causal.
CITATION STYLE
Zhang, Z., Chen, J., & Yang, D. (2023). Mitigating Biases in Hate Speech Detection from A Causal Perspective. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 6610–6625). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-emnlp.440
Mendeley helps you to discover research relevant for your work.