Efficient Parallel Inflated 3D Convolution Architecture for Action Recognition

34Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Deep neural networks have received increasing attention in human action recognition. Previous research has established that utilizing 3D convolution is a reasonable approach to learn spatio-temporal representation. Nevertheless, constructing effective 3D ConvNets usually need an expensive pre-training process that performing on a huge-scale video dataset. To avoid this burdensome situation, one major issue is to determine whether the pre-trained parameters of 2D convolution networks can be directly bootstrapped into 3D. In this paper, we devise a 2D-Inflated operation and a parallel 3D ConvNet architecture to solve this problem. The 2D-Inflated operation is used for converting pre-trained 2D ConvNets into 3D ConvNets, which avoiding video data pre-training. We further explore the optimal quantity of 3D ConvNet in the parallel architecture, and the results suggest that 6-nets architecture is an excellent solution for recognition. Another contribution of our study is two practical and valid skills, accumulated gradient descent and video sequence decomposition. Either of those techniques can promote the improvement of performance. The recognition results of UCF101 and HMDB51 reveal that, without the video data pre-training, our 3D ConvNets still can achieve competitive performance to the other generic and recent methods of using 3D ConvNets in the RGB image domain.

Cite

CITATION STYLE

APA

Huang, Y., Guo, Y., & Gao, C. (2020). Efficient Parallel Inflated 3D Convolution Architecture for Action Recognition. IEEE Access, 8, 45753–45765. https://doi.org/10.1109/ACCESS.2020.2978223

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free