mCLIP: Multilingual CLIP via Cross-lingual Transfer

8Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.

Abstract

Large-scale vision-language pretrained (VLP) models like CLIP have shown remarkable performance on various downstream cross-modal tasks. However, they are usually biased towards English due to the lack of sufficient non-English image-text pairs. Existing multilingual VLP methods often learn retrieval-inefficient single-stream models by translation-augmented non-English image-text pairs. In this paper, we introduce mCLIP, a retrieval-efficient dual-stream multilingual VLP model, trained by aligning the CLIP model and a Multilingual Text Encoder (MTE) through a novel Triangle Cross-modal Knowledge Distillation (TriKD) method. It is parameter-efficient as only two light projectors on the top of them are updated during distillation. Furthermore, to enhance the token- and sentence-level multilingual representation of the MTE, we propose to train it with machine translation and contrastive learning jointly before the TriKD to provide a better initialization. Empirical results show that mCLIP achieves new state-of-the-art performance for both zero-shot and finetuned multilingual image-text retrieval task.

Cite

CITATION STYLE

APA

Chen, G., Hou, L., Chen, Y., Dai, W., Shang, L., Jiang, X., … Wang, W. (2023). mCLIP: Multilingual CLIP via Cross-lingual Transfer. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 13028–13043). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.728

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free