LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

Hai Zhu; Qingyang Zhao; Weiwei Shang; Yuren Wu; Kai Liu

Conference ProceedingsOPEN ACCESS

LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

Proceedings of the AAAI Conference on Artificial Intelligence (2024) 38(17) 19759-19767

DOI: 10.1609/aaai.v38i17.29950

1Citations

11Readers

Abstract

Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt model internal information (gradients or confidence scores) to generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models and some defense methods, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.

Cite

CITATION STYLE

APA

Zhu, H., Zhao, Q., Shang, W., Wu, Y., & Liu, K. (2024). LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, pp. 19759–19767). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i17.29950

LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

Abstract

Cite

Register to see more suggestions