State-of-the-art hand gesture recognition methods have investigated the spatiotemporal features based on 3D convolutional neural networks (3DCNNs) or convolutional long short-term memory (ConvLSTM). However, they often suffer from the inefficiency due to the high computational complexity of their network structures. In this paper, we focus instead on the 1D convolutional neural networks and propose a simple and efficient architectural unit, Multi-Kernel Temporal Block (MKTB), that models the multi-scale temporal responses by explicitly applying different temporal kernels. Then, we present a Global Refinement Block (GRB), which is an attention module for shaping the global temporal features based on the cross-channel similarity. By incorporating the MKTB and GRB, our architecture can effectively explore the spatiotemporal features within tolerable computational cost. Extensive experiments conducted on public datasets demonstrate that our proposed model achieves the state-of-the-art with higher efficiency. Moreover, the proposed MKTB and GRB are plug-and-play modules and the experiments on other tasks, like video understanding and video-based person re-identification, also display their good performance in efficiency and capability of generalization.
CITATION STYLE
Yi, Y., Ni, F., Ma, Y., Zhu, X., Qi, Y., Qiu, R., … Wang, Y. (2019). High performance gesture recognition via effective and efficient temporal modeling. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2019-August, pp. 1003–1009). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2019/141
Mendeley helps you to discover research relevant for your work.