Deciphering Stereotypes in Pre-Trained Language Models

7Citations
Citations of this article
14Readers
Mendeley users who have this article in their library.

Abstract

Warning: This paper discusses content that could potentially trigger discomfort due to the presence of stereotypes. This paper addresses the issue of demographic stereotypes present in Transformer-based pretrained language models (PLMs) and aims to deepen our understanding of how these biases are encoded in these models. To accomplish this, we introduce an easy-to-use framework for examining the stereotype-encoding behavior of PLMs through a combination of model probing and textual analyses. Our findings reveal that a small subset of attention heads within PLMs are primarily responsible for encoding stereotypes and that stereotypes toward specific minority groups can be identified using attention maps on these attention heads. Leveraging these insights, we propose an attention-head pruning method as a viable approach for debiasing PLMs, without compromising their language modeling capabilities or adversely affecting their performance on downstream tasks.

Cite

CITATION STYLE

APA

Ma, W., Scheible, H., Wang, B., Veeramachaneni, G., Chowdhary, P., Sun, A., … Vosoughi, S. (2023). Deciphering Stereotypes in Pre-Trained Language Models. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 11328–11345). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.697

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free