Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.
CITATION STYLE
Mosca, E., Rando-Ramirez, J., Agarwal, S., & Groh, G. (2022). “That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 7806–7816). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.538
Mendeley helps you to discover research relevant for your work.