HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

Dongen Guo; Zechen Wu; Jiangfan Feng; Zhuoke Zhou; Zhen Shen

Journal Article

HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

Applied Intelligence (2023) 53(21) 24947-24962

DOI: 10.1007/s10489-023-04725-y

6Citations

10Readers

Get full text

Abstract

Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv ∗), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model’s computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.

Author supplied keywords

Cite

CITATION STYLE

APA

Guo, D., Wu, Z., Feng, J., Zhou, Z., & Shen, Z. (2023). HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification. Applied Intelligence, 53(21), 24947–24962. https://doi.org/10.1007/s10489-023-04725-y

HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

Abstract

Author supplied keywords

Cite

Register to see more suggestions