Query-Efficient Black-Box Red Teaming via Bayesian Optimization

4Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.

Abstract

The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and interacts with the victim model to discover a diverse set of failures with limited query access. Existing red teaming methods construct test cases based on human supervision or language model (LM) and query all test cases in a brute-force manner without incorporating any information from past evaluations, resulting in a prohibitively large number of queries. To this end, we propose Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods based on Bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations. Experimental results on various user input pools demonstrate that our method consistently finds a significantly larger number of diverse positive test cases under the limited query budget than the baseline methods. The source code is available at https://github.com/snu-mllab/Bayesian-RedTeaming.

Cite

CITATION STYLE

APA

Lee, D., Lee, J. Y., Ha, J. W., Kim, J. H., Lee, S. W., Lee, H., & Song, H. O. (2023). Query-Efficient Black-Box Red Teaming via Bayesian Optimization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 11551–11574). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.646

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free