Abstract
In the scenario of Model-as-a-Service (MaaS), pre-trained models are usually released as inference APIs. Users are allowed to query those models with manually crafted prompts. Without accessing the network structure and gradient information, it's tricky to perform continuous prompt tuning on MaaS, especially for vision-language models (VLMs) considering cross-modal interaction. In this paper, we propose a black-box prompt tuning framework for VLMs to learn task-relevant prompts without back-propagation. In particular, the vision and language prompts are jointly optimized in the intrinsic parameter subspace with various evolution strategies. Different prompt variants are also explored to enhance the cross-model interaction. Experimental results show that our proposed black-box prompt tuning framework outperforms both hand-crafted prompt engineering and gradient-based prompt learning methods, which serves as evidence of its capability to train task-relevant prompts in a derivative-free manner.
Cite
CITATION STYLE
Yu, L., Chen, Q., Lin, J., & He, L. (2023). Black-box Prompt Tuning for Vision-Language Model as a Service. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2023-August, pp. 1686–1694). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2023/187
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.