GPTs Don't Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models

Evan Lucas; Timothy C. Havens

Conference ProceedingsOPEN ACCESS

GPTs Don't Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 242-248

DOI: 10.18653/v1/2023.trustnlp-1.21

1Citations

11Readers

Abstract

This work analyzes backdoor watermarks in an autoregressive transformer fine-tuned to perform a generative sequence-to-sequence task, specifically summarization. We propose and demonstrate an attack to identify trigger words or phrases by analyzing open ended generations from autoregressive models that have backdoor watermarks inserted. It is shown in our work that triggers based on random common words are easier to identify than those based on single, rare tokens. The attack proposed is easy to implement and only requires access to the model weights.

Cite

CITATION STYLE

APA

Lucas, E., & Havens, T. C. (2023). GPTs Don’t Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 242–248). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.trustnlp-1.21

GPTs Don't Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models

Abstract

Cite

Register to see more suggestions