State-of-the-art attacks on NLP models lack a shared definition of what constitutes a successful attack. These differences make the attacks difficult to compare and hindered the use of adversarial examples to understand and improve NLP models. We distill ideas from past work into a unified framework: a successful natural language adversarial example is a perturbation that fools the model and follows four proposed linguistic constraints. We categorize previous attacks based on these constraints. For each constraint, we suggest options for human and automatic evaluation methods. We use these methods to evaluate two state-of-the-art synonym substitution attacks. We find that perturbations often do not preserve semantics, and 38% introduce grammatical errors. Next, we conduct human studies to find a threshold for each evaluation method that aligns with human judgment. Human surveys reveal that to successfully preserve semantics, we need to significantly increase the minimum cosine similarities between the embeddings of swapped words and between the sentence encodings of original and perturbed sentences. With constraints adjusted to better preserve semantics and grammaticality, the attack success rate drops by over 70 percentage points.
CITATION STYLE
Morris, J. X., Lifland, E., Lanchantin, J., Ji, Y., & Qi, Y. (2020). Reevaluating adversarial examples in natural language. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 3829–3839). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.341
Mendeley helps you to discover research relevant for your work.