Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer

Minghua Liu; Wenjing Li; Bo He; Chuanxu Wang; Lianen Qu

Journal ArticleOPEN ACCESS

Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer

Liu M
Li W
He B
et al.

Applied Sciences (Switzerland) (2025) 15(5)

DOI: 10.3390/app15052695

6Citations

12Readers

Get full text

Abstract

To address the limitations of traditional two-stream networks, such as inadequate spatiotemporal information fusion, limited feature diversity, and insufficient accuracy, we propose an improved two-stream network for human action recognition based on multi-scale attention Transformer and 3D convolutional (C3D) fusion. In the temporal stream, the traditional 2D convolutional is replaced with a C3D network to effectively capture temporal dynamics and spatial features. In the spatial stream, a multi-scale convolutional Transformer encoder is introduced to extract features. Leveraging the multi-scale attention mechanism, the model captures and enhances features at various scales, which are then adaptively fused using a weighted strategy to improve feature representation. Furthermore, through extensive experiments on feature fusion methods, the optimal fusion strategy for the two-stream network is identified. Experimental results on benchmark datasets such as UCF101 and HMDB51 demonstrate that the proposed model achieves superior performance in action recognition tasks.

Author supplied keywords

Cite

CITATION STYLE

APA

Liu, M., Li, W., He, B., Wang, C., & Qu, L. (2025). Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer. Applied Sciences (Switzerland), 15(5). https://doi.org/10.3390/app15052695

Human Action Recognition Based on 3D Convolution and Multi-Attention Transformer

Abstract

Author supplied keywords

Cite

Register to see more suggestions