On the Difference of BERT-style and CLIP-style Text Encoders

12Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERT-style ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to the senses of humans. Our code is released at https://github.com/zhjohnchan/probing-clip-dev.

Cite

CITATION STYLE

APA

Chen, Z., Chen, G. H., Diao, S., Wan, X., & Wang, B. (2023). On the Difference of BERT-style and CLIP-style Text Encoders. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 13710–13721). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.866

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free