Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition

7Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The RGB-D-based human action recognition is gaining increasing attention because the different modalities can provide complementary information. However, the recognition performance is still not satisfactory due to the limited ability to learn spatial-temporal feature and insufficient inter-model interaction. In this paper, we propose a novel approach for RGB-D human action recognition by aggregating spatial-temporal information and implementing cross-modality interactive learning. Firstly, a spatial-temporal information aggregation module (STIAM) is proposed to utilizes sample convolutional neural networks (CNNs) to aggregate the spatial-temporal information in entire RGB-D sequence into lightweight representations efficiently. This allows the model to extract richer spatial-temporal features with limited extra memory and computational cost. Secondly, a cross-modality interactive module (CMIM) is proposed to fully fuse the multi-modal complementary information. Moreover, a multi-modal interactive network (MMINet) is constructed for RGB-D-based action recognition by embeding the above two modules into the two-stream CNNs. In order to verify the universality of our approach, two backbones are deployed in the two-stream architecture, successively. Ablation experiments demonstrate that the proposed STIAM can bring significant improvement in recognizing actions. CMIM can further play the advantages of complementary features of multiple modalities. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120 and PKU-MMD datasets proved the effectiveness of the proposed approach. The proposed approach achieves an accuracy of 94.3% and 96.5% for cross-subject and cross-view on NTU RGB+D 60, 91.7% and 92.6% for cross-subject and cross-setup on NTU RGB+D 120, 93.6% and 94.2% for cross-subject and cross-view on PKU-MMD datasets, which are the state-of-the-art performance. Further analysis denotes that our approach has advantages in recognizing subtle actions.

Cite

CITATION STYLE

APA

Cheng, Q., Liu, Z., Ren, Z., Cheng, J., & Liu, J. (2022). Spatial-Temporal Information Aggregation and Cross-Modality Interactive Learning for RGB-D-Based Human Action Recognition. IEEE Access, 10, 104190–104201. https://doi.org/10.1109/ACCESS.2022.3201227

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free