TransVLAD: Focusing on Locally Aggregated Descriptors for Few-Shot Learning

N/ACitations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper presents a transformer framework for few-shot learning, termed TransVLAD, with one focus showing the power of locally aggregated descriptors for few-shot learning. Our TransVLAD model is simple: a standard transformer encoder following a NeXtVLAD aggregation module to output the locally aggregated descriptors. In contrast to the prevailing use of CNN as part of the feature extractor, we are the first to prove self-supervised learning like masked autoencoders (MAE) can deal with the overfitting of transformers in few-shot image classification. Besides, few-shot learning can benefit from this general-purpose pre-training. Then, we propose two methods to mitigate few-shot biases, supervision bias and simple-characteristic bias. The first method is introducing masking operation into fine-tuning, by which we accelerate fine-tuning (by more than 3x) and improve accuracy. The second one is adapting focal loss into soft focal loss to focus on hard characteristics learning. Our TransVLAD finally tops 10 benchmarks on five popular few-shot datasets by an average of more than 2%.

Cite

CITATION STYLE

APA

Li, H., Zhang, L., Zhang, D., Fu, L., Yang, P., & Zhang, J. (2022). TransVLAD: Focusing on Locally Aggregated Descriptors for Few-Shot Learning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13680 LNCS, pp. 524–540). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-20044-1_30

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free