MIL-Decoding: Detoxifying Language Models at Token-Level via Multiple Instance Learning

Xu Zhang; Xiaojun Wan

Conference Proceedings

MIL-Decoding: Detoxifying Language Models at Token-Level via Multiple Instance Learning

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 1 190-202

DOI: 10.18653/v1/2023.acl-long.11

10Citations

19Readers

Get full text

Abstract

Despite advances in large pre-trained neural language models, they are prone to generating toxic language, which brings security risks to their applications. We introduce MIL-Decoding, which detoxifies language models at token-level by interpolating it with a trained multiple instance learning (MIL) network. MIL model is trained on a corpus with a toxicity label for each text to predict the overall toxicity and the toxicity of each token in its context. Intuitively, the MIL network computes a toxicity distribution over next tokens according to the generated context which supplements the original language model to avoid toxicity. We evaluate MIL-Decoding with automatic metrics and human evaluation, where MIL-Decoding outperforms other baselines in detoxification while it only hurts generation fluency a little bit.

Cite

CITATION STYLE

APA

Zhang, X., & Wan, X. (2023). MIL-Decoding: Detoxifying Language Models at Token-Level via Multiple Instance Learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 190–202). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.11

MIL-Decoding: Detoxifying Language Models at Token-Level via Multiple Instance Learning

Abstract

Cite

Register to see more suggestions