Toward Stronger Textual Attack Detectors

2Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The landscape of available textual adversarial attacks keeps growing, posing severe threats and raising concerns regarding the deep NLP system's integrity. However, the crucial problem of defending against malicious attacks has only drawn the attention in the NLP community. The latter is nonetheless instrumental in developing robust and trustworthy systems. This paper makes two important contributions in this line of search: (i) we introduce LAROUSSE, a new framework to detect textual adversarial attacks and (ii) we introduce STAKEOUT, a new benchmark composed of nine popular attack methods, three datasets, and two pre-trained models. LAROUSSE is ready-to-use in production as it is unsupervised, hyperparameter-free, and non-differentiable, protecting it against gradient-based methods. Our new benchmark STAKEOUT allows for a robust evaluation framework: we conduct extensive numerical experiments which demonstrate that LAROUSSE outperforms previous methods, and which allows to identify interesting factors of detection rate variations.

Cite

CITATION STYLE

APA

Colombo, P., Picot, M., Noiry, N., Staerman, G., & Piantanida, P. (2023). Toward Stronger Textual Attack Detectors. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 484–505). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-emnlp.35

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free