Learning to Ignore Adversarial Attacks

1Citations
Citations of this article
24Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can successfully ignore over 90% of attack tokens. This approach leads to consistent and sizable improvements (∼10%) over baseline models in robustness on three datasets for both BERT and RoBERTa, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set and hence reduce the effect of adversarial attacks.

Cite

CITATION STYLE

APA

Zhang, Y., Zhou, Y., Carton, S., & Tan, C. (2023). Learning to Ignore Adversarial Attacks. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference (pp. 2962–2976). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.eacl-main.216

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free