Gradient-based Adversarial Attacks against Text Transformers

Chuan Guo; Alexandre Sablayrolles; Hervé Jégou; Douwe Kiela

Conference ProceedingsOPEN ACCESS

Gradient-based Adversarial Attacks against Text Transformers

EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (2021) 5747-5757

DOI: 10.18653/v1/2021.emnlp-main.464

99Citations

123Readers

Abstract

We propose the first general-purpose gradient-based adversarial attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix, hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks, outperforming prior work in terms of adversarial success rate with matching imperceptibility as per automated and human evaluation. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.

Cite

CITATION STYLE

APA

Guo, C., Sablayrolles, A., Jégou, H., & Kiela, D. (2021). Gradient-based Adversarial Attacks against Text Transformers. In EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 5747–5757). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.emnlp-main.464

Gradient-based Adversarial Attacks against Text Transformers

Abstract

Cite

Register to see more suggestions