Multi-scale spatial-temporal integration convolutional tube for human action recognition

5Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.

Abstract

Applying multi-scale representations leads to consistent performance improvements on a wide range of image recognition tasks. However, with the addition of the temporal dimension in video domain, directly obtaining layer-wise multi-scale spatial-temporal features will add a lot extra computational cost. In this work, we propose a novel and efficient Multi-Scale Spatial-Temporal Integration Convolutional Tube (MSTI) aiming at achieving accurate recognition of actions with lower computational cost. It firstly extracts multi-scale spatial and temporal features through the multi-scale convolution block. Considering the interaction of different-scales representations and the interaction of spatial appearance and temporal motion, we employ the cross-scale attention weighted blocks to perform feature recalibration by integrating multi-scale spatial and temporal features. An end-to-end deep network, MSTI-Net, is also presented based on the proposed MSTI tube for human action recognition. Extensive experimental results show that our MSTI-Net significantly boosts the performance of existing convolution networks and achieves state-of-the-art accuracy on three challenging benchmarks, i.e., UCF-101, HMDB-51 and Kinetics-400, with much fewer parameters and FLOPs.

Cite

CITATION STYLE

APA

Wu, H., Liu, J., Zhu, X., Wang, M., & Zha, Z. J. (2020). Multi-scale spatial-temporal integration convolutional tube for human action recognition. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2021-January, pp. 753–759). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2020/105

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free