Global and Local Feature Interaction with Vision Transformer for Few-shot Image Classification

11Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

Abstract

Image classification is a classical machine learning task and has been widely used. Due to the high costs of annotation and data collection in real scenarios, few-shot learning has become a vital technique to improve image classification performances. However, most existing few-shot image classification methods only focus on modeling the global image feature or image local patches, which ignore the global-local interactions. In this study, we propose a new method, named GL-ViT, to integrate both global and local features to fully exploit the few-shot samples for image classification. Firstly, we design a feature extractor module to calculate the interactions between the global representation and local patch embeddings, where ViT is also adopted to achieve efficient and effective image representation. Then, Earth Mover's Distance is adopted to measure the similarity between two images. Abundant Experimental results on several widely-used open datasets show that GL-ViT outperforms state-of-the-art algorithms significantly, and our ablation studies also verify the effectiveness of both global-local features.

Cite

CITATION STYLE

APA

Sun, M., Ma, W., & Liu, Y. (2022). Global and Local Feature Interaction with Vision Transformer for Few-shot Image Classification. In International Conference on Information and Knowledge Management, Proceedings (pp. 4530–4534). Association for Computing Machinery. https://doi.org/10.1145/3511808.3557604

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free