Vision transformers for remote sensing image classification

398Citations
Citations of this article
315Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we propose a remote-sensing scene-classification method based on vision transformers. These types of networks, which are now recognized as state-of-the-art models in natural language processing, do not rely on convolution layers as in standard convolutional neural networks (CNNs). Instead, they use multihead attention mechanisms as the main building block to derive long-range contextual relation between pixels in images. In a first step, the images under analysis are divided into patches, then converted to sequence by flattening and embedding. To keep information about the position, embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers for generating the final representation. At the classification stage, the first token sequence is fed to a softmax classification layer. To boost the classification performance, we explore several data augmentation strategies to generate additional data for training. Moreover, we show experimentally that we can compress the network by pruning half of the layers while keeping competing classification accuracies. Experimental results conducted on different remote-sensing image datasets demonstrate the promising capability of the model compared to state-of-the-art methods. Specifically, Vision Transformer obtains an average classification accuracy of 98.49%, 95.86%, 95.56% and 93.83% on Merced, AID, Optimal31 and NWPU datasets, respectively. While the compressed version obtained by removing half of the multihead attention layers yields 97.90%, 94.27%, 95.30% and 93.05%, respectively.

References Powered by Scopus

Histograms of oriented gradients for human detection

30478Citations
N/AReaders
Get full text

Face description with local binary patterns: Application to face recognition

4913Citations
N/AReaders
Get full text

CutMix: Regularization strategy to train strong classifiers with localizable features

3693Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Remote Sensing Image Change Detection with Transformers

898Citations
N/AReaders
Get full text

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

552Citations
N/AReaders
Get full text

Large Selective Kernel Network for Remote Sensing Object Detection

206Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Bazi, Y., Bashmal, L., Al Rahhal, M. M., Dayil, R. A., & Ajlan, N. A. (2021). Vision transformers for remote sensing image classification. Remote Sensing, 13(3), 1–20. https://doi.org/10.3390/rs13030516

Readers over time

‘21‘22‘23‘24‘250255075100

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 68

70%

Lecturer / Post doc 12

12%

Researcher 11

11%

Professor / Associate Prof. 6

6%

Readers' Discipline

Tooltip

Computer Science 57

63%

Engineering 22

24%

Earth and Planetary Sciences 6

7%

Environmental Science 6

7%

Article Metrics

Tooltip
Mentions
Blog Mentions: 1
Social Media
Shares, Likes & Comments: 5

Save time finding and organizing research with Mendeley

Sign up for free
0