Gradient-based Adversarial Attacks against Text Transformers

99Citations
Citations of this article
123Readers
Mendeley users who have this article in their library.

Abstract

We propose the first general-purpose gradient-based adversarial attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix, hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks, outperforming prior work in terms of adversarial success rate with matching imperceptibility as per automated and human evaluation. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.

Cite

CITATION STYLE

APA

Guo, C., Sablayrolles, A., Jégou, H., & Kiela, D. (2021). Gradient-based Adversarial Attacks against Text Transformers. In EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 5747–5757). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.emnlp-main.464

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free