DueT: Image-Text Contrastive Transfer Learning with Dual-adapter Tuning

4Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper presents DueT, a novel transfer learning method for vision and language models built by contrastive learning. In DueT, adapters are inserted into the image and text encoders, which have been initialized using models pre-trained on uni-modal corpora and then frozen. By training only these adapters, DueT enables efficient learning with a reduced number of trainable parameters. Moreover, unlike traditional adapters, those in DueT are equipped with a gating mechanism, enabling effective transfer and connection of knowledge acquired from pre-trained uni-modal encoders while preventing catastrophic forgetting. We report that DueT outperformed simple finetuning, the conventional method fixing only the image encoder and training only the text encoder, and the LoRA-based adapter method in accuracy and parameter efficiency for 0-shot image and text retrieval in both English and Japanese domains.

Cite

CITATION STYLE

APA

Hasegawa, T., Nishida, K., Maeda, K., & Saito, K. (2023). DueT: Image-Text Contrastive Transfer Learning with Dual-adapter Tuning. In EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 13607–13624). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.emnlp-main.839

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free