DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency

0Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we propose a simple yet effective transformer framework for self-supervised learning called DenseDINO to learn dense visual representations. To exploit the spatial information that the dense prediction tasks require but neglected by the existing self-supervised transformers, we introduce point-level supervision across views in a novel token-based way. Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior. With the reference token, the model could maintain spatial consistency and deal with multi-object complex scene images, thus generalizing better on dense prediction tasks. Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet and achieves a large margin (+7.2% mIoU) improvement in semantic segmentation on PascalVOC under the linear probing protocol for segmentation.

Cite

CITATION STYLE

APA

Yuan, Y., Fu, X., Yu, Y., & Li, X. (2023). DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2023-August, pp. 1695–1703). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2023/188

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free