Abstract
Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks, including classification, segmentation, and detection. However, the potential of these models for low-shot learning across several downstream tasks remains largely under explored. In this work, we conduct a systematic examination of different self-supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling, to assess their low-shot capabilities by comparing different pretrained models. In addition, we explore the impact of various collapse avoidance techniques, such as centring, ME-MAX, and sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework that combines mask image modelling and clustering as pretext tasks. This framework demonstrates superior performance across all examined low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on large-scale datasets, we show performance gains in various tasks.
Author supplied keywords
Cite
CITATION STYLE
Nandam, S. R., Atito, S., Feng, Z., Kittler, J., & Awais, M. (2025). Investigating Self-Supervised Methods for Label-Efficient Learning. International Journal of Computer Vision, 133(7), 4522–4537. https://doi.org/10.1007/s11263-025-02397-4
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.