Deeply Coupled Cross-Modal Prompt Learning

8Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.

Abstract

Recent advancements in multimodal foundation models (e.g., CLIP) have excelled in zero-shot generalization. Prompt tuning involved in the knowledge transfer from foundation models to downstream tasks has gained significant attention recently. Existing prompt-tuning methods in cross-modal learning, however, either solely focus on language branch, or learn vision-language interaction in a shallow mechanism. In this context, we propose a Deeply coupled Cross-modal Prompt learning (DCP) method based on CLIP. DCP flexibly accommodates the interplay between vision and language with a Cross-Modal Prompt Attention (CMPA) mechanism, which enables the mutual exchange of respective representation through a well-connected multi-head attention module progressively and strongly. We then conduct comprehensive few-shot learning experiments on 11 image classification datasets and analyze the robustness to domain shift as well. Thorough experimental analysis evidently demonstrates the superb few-shot generalization and compelling domain adaption capacity of a well-executed DCP. The code can be found at https://github.com/GingL/CMPA.

Cite

CITATION STYLE

APA

Liu, X., Tang, W., Lu, J., Zhao, R., Guo, Z., & Tan, F. (2023). Deeply Coupled Cross-Modal Prompt Learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 7957–7970). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.504

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free