Image classification is an important research direction in the field of image processing and computer vision. It aims to identify the specific category of the object in the image and has important practical application value. However,the classification effect of the existing methods is always unsatisfactory because of the diversity of the shape and type of image objects and the complexity of the imaging environment. Moreover,the existing problems,such as low classification accuracy and high false positives,seriously affect the application of image classification in the subsequent image and computer vision-related tasks. Therefore,improving image classification accuracy through postprocessing algorithms is highly desirable. Given the wide application of deep learning techniques,such as deep convolutional neural networks and generative adversarial neural networks,in the field of natural image object detection,the research on the application of deep learning techniques in image classification has received great attention and become a research hotspot in the field of image processing and computer vision in recent years. Moreover,many excellent works have been born. As a rising star,visual Transformer(ViT)gains an increasing interest in image processing tasks,particularly because of its strong ability of remote modeling and parallel sequence processing. Several technical review articles on the Transformer have been recently published. Moreover,ViT and its variants have been systematically summarized from different angles,and the application of the Transformer in different visual tasks has been introduced. This scenario provides appropriate help for people studying and tracking the research progress of image classification technology. Compared with traditional convolutional neural network (CNN),ViT achieves global modeling and parallel processing of the image by dividing the input image into patches. Thus,the image classification ability of the model is greatly improved. However,many problems,such as poor scalability,high computational overhead,slow convergence,and attention collapse,still exist because of the complexity of image classification problems and the diversity of the development of ViT technology. These problems can be solved using the ViT variants in image processing tasks. Moreover,the reviews that can help scholars comprehensively understand and grasp the latest progress of ViT for image processing tasks from a global perspective are very few. Therefore,the present study systematically compares and summarizes the ViT algorithms for image classification based on the full study of the latest reviews and related research to help scholars understand and grasp the latest progress of image classification research based on ViT. Unlike the existing review papers,our work is particularly focused on the research methods at home and abroad in the past 2 years(between January 2021 and December 31,2022). We begin by describing the basic concept,principle,and structure of the traditional Transformer model for easy understanding. First,we introduce the attention mechanism and multi-head attention mechanism. Then,the feed-forward neural network and position coding are described. Finally,the model structure of the traditional Transformer is presented. Afterward,the evolution of the Transformer model and its applications in image processing in recent years are figured. Then,the concept,principle,and structure of ViT are briefly introduced. Various vision Transformer models and applications in image classification are described in detail according to the problems faced by ViT. Different solutions,including scalable location coding,low complexity,low computing cost,local and global information fusion,and deep ViT model,are described one by one. Experiments on ImageNet,Canadian Institute for Advanced Research(CIFAR-10),and CIFAR-100 are provided,and many evaluations are presented to demonstrate the classification performance of the ViT and its variants for image classification. Two indicators are adopted,namely,accuracy and parameter quantity,to evaluate experimental results. Floating point operation(FLOPs)per second is also used to analyze the performance of the model comprehensively. Given that the Transformer has also been widely used in remote sensing image classification in recent years,the present study compares and analyzes the remote sensing image classification methods based on the Transformer. The experiments are performed on the hyperspectral image datasets of Indian Pines,Trento,and Salinas to evaluate the Transformer for the remote sensing image classification. Three indicators,namely,overall accuracy(OA),average accuracy(AA),and Kappa coefficient,are employed in this work. Finally,the problems and challenges faced by the current application of ViT in image classification are presented. Future research and development trends are also prospected.
CITATION STYLE
Shi, Z., Li, C., Zhou, L., Zhang, Z., Wu, C., You, Z., & Ren, W. (2023). Survey on Transformer for image classification. Journal of Image and Graphics, 28(9), 2661–2692. https://doi.org/10.11834/jig.220799
Mendeley helps you to discover research relevant for your work.