In black-box adversarial attacks, attackers query the deep neural network (DNN) and use the query results to optimize the adversarial samples iteratively. In this paper, we study the method of adding white noise to the DNN output to mitigate such attacks. One of our unique contributions is a theoretical analysis of gradient signal-to-noise ratio (SNR), which shows the trade-off between the defense noise level and the attack query cost. The attacker's query count (QC) is derived mathematically as a function of noise standard deviation. This will guide the defender to find the appropriate noise level for mitigating attacks to the desired security level specified by QC and DNN performance loss. Our analysis shows that the added noise is drastically magnified by the small variation of DNN outputs, which makes the reconstructed gradient have an extremely low SNR. Adding slight white noise with a very small standard deviation, e.g., less than 0.01, is enough to increase QC by many orders of magnitude yet without introducing any noticeable classification accuracy reduction. Our experiments demonstrate that this method can effectively mitigate both soft-label and hard-label black-box attacks under realistic QC constraints. We also prove that this method outperforms many other defense methods and is robust to the attacker's countermeasures.
CITATION STYLE
Aithal, M. B., & Li, X. (2022). Mitigating Black-Box Adversarial Attacks via Output Noise Perturbation. IEEE Access, 10, 12395–12411. https://doi.org/10.1109/ACCESS.2022.3146198
Mendeley helps you to discover research relevant for your work.