Reevaluating adversarial examples in natural language

John X. Morris; Eli Lifland; Jack Lanchantin; Yangfeng Ji; Yanjun Qi

Conference ProceedingsOPEN ACCESS

Reevaluating adversarial examples in natural language

Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (2020) 3829-3839

DOI: 10.18653/v1/2020.findings-emnlp.341

52Citations

112Readers

Abstract

State-of-the-art attacks on NLP models lack a shared definition of what constitutes a successful attack. These differences make the attacks difficult to compare and hindered the use of adversarial examples to understand and improve NLP models. We distill ideas from past work into a unified framework: a successful natural language adversarial example is a perturbation that fools the model and follows four proposed linguistic constraints. We categorize previous attacks based on these constraints. For each constraint, we suggest options for human and automatic evaluation methods. We use these methods to evaluate two state-of-the-art synonym substitution attacks. We find that perturbations often do not preserve semantics, and 38% introduce grammatical errors. Next, we conduct human studies to find a threshold for each evaluation method that aligns with human judgment. Human surveys reveal that to successfully preserve semantics, we need to significantly increase the minimum cosine similarities between the embeddings of swapped words and between the sentence encodings of original and perturbed sentences. With constraints adjusted to better preserve semantics and grammaticality, the attack success rate drops by over 70 percentage points.

Cite

CITATION STYLE

APA

Morris, J. X., Lifland, E., Lanchantin, J., Ji, Y., & Qi, Y. (2020). Reevaluating adversarial examples in natural language. In Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020 (pp. 3829–3839). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.findings-emnlp.341

Reevaluating adversarial examples in natural language

Abstract

Cite

Register to see more suggestions