Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Wenhao Wu; Zhun Sun; Yuxin Song; Jingdong Wang; Wanli Ouyang

Journal ArticleOPEN ACCESS

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

International Journal of Computer Vision (2024) 132(2) 392-409

DOI: 10.1007/s11263-023-01876-w

4Citations

18Readers

Abstract

Transferring knowledge from pre-trained deep models for downstream tasks, particularly with limited labeled samples, is a fundamental problem in computer vision research. Recent advances in large-scale, task-agnostic vision-language pre-trained models, which are learned with billions of samples, have shed new light on this problem. In this study, we investigate how to efficiently transfer aligned visual and textual knowledge for downstream visual recognition tasks. We first revisit the role of the linear classifier in the vanilla transfer learning framework, and then propose a new paradigm where the parameters of the classifier are initialized with semantic targets from the textual encoder and remain fixed during optimization. To provide a comparison, we also initialize the classifier with knowledge from various resources. In the empirical study, we demonstrate that our paradigm improves the performance and training speed of transfer learning tasks. With only minor modifications, our approach proves effective across 17 visual datasets that span three different data domains: image, video, and 3D point cloud.

Author supplied keywords

Cite

CITATION STYLE

APA

Wu, W., Sun, Z., Song, Y., Wang, J., & Ouyang, W. (2024). Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective. International Journal of Computer Vision, 132(2), 392–409. https://doi.org/10.1007/s11263-023-01876-w

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Abstract

Author supplied keywords

Cite

Register to see more suggestions