Abstract
Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERT-style ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to the senses of humans. Our code is released at https://github.com/zhjohnchan/probing-clip-dev.
Cite
CITATION STYLE
Chen, Z., Chen, G. H., Diao, S., Wan, X., & Wang, B. (2023). On the Difference of BERT-style and CLIP-style Text Encoders. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 13710–13721). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.866
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.