Reevaluating adversarial examples in natural language

52Citations
Citations of this article
112Readers
Mendeley users who have this article in their library.

Abstract

State-of-the-art attacks on NLP models lack a shared definition of what constitutes a successful attack. These differences make the attacks difficult to compare and hindered the use of adversarial examples to understand and improve NLP models. We distill ideas from past work into a unified framework: a successful natural language adversarial example is a perturbation that fools the model and follows four proposed linguistic constraints. We categorize previous attacks based on these constraints. For each constraint, we suggest options for human and automatic evaluation methods. We use these methods to evaluate two state-of-the-art synonym substitution attacks. We find that perturbations often do not preserve semantics, and 38% introduce grammatical errors. Next, we conduct human studies to find a threshold for each evaluation method that aligns with human judgment. Human surveys reveal that to successfully preserve semantics, we need to significantly increase the minimum cosine similarities between the embeddings of swapped words and between the sentence encodings of original and perturbed sentences. With constraints adjusted to better preserve semantics and grammaticality, the attack success rate drops by over 70 percentage points.

Cite

CITATION STYLE

APA

Morris, J. X., Lifland, E., Lanchantin, J., Ji, Y., & Qi, Y. (2020). Reevaluating adversarial examples in natural language. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 3829–3839). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.341

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free