Dynamic Gesture Recognition Based on Three-Stream Coordinate Attention Network and Knowledge Distillation

5Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Gesture recognition has always been one of the important research directions in the field of computer vision. The dynamic gesture has the problems of complex backgrounds and many interference factors. The gesture recognition model based on deep learning usually has high computational cost and poor real-time performance. In addition, deep learning models are limited to recognizing existing categories in the training set and their performance largely depends on the amount of labeled data. To address the above problems, this paper presents a dynamic gesture recognition method named 3SCKI based on a three-stream coordinate attention (CA) network, knowledge distillation, and image-text contrastive learning. Specifically, 1) CA is utilized for feature fusion to make the model focus more on target gestures and reduce background interference, 2) traditional knowledge distillation loss is improved to reduce the amount of calculation and improve the real-time performance. Specifically, the guidance function is added to make the student network only learn the classification probability correctly identified by the teacher network, and 3) multi-granularity context prompt template integration method is proposed to construct an improved CLIP visual language model MG-CLIP. It aligns text and visual concepts from the image level to the object level to the part level. Through comparative learning of image features and text features, gesture classification is performed, enabling the model to identify image categories that have not appeared during the training phase. The proposed method is evaluated on the ChaLearn LAP large-scale isolated gesture dataset (IsoGD). The results show that our proposed method can obtain recognition rates of 65.87% on the validation set of IsoGD. For single mode data, 3SCKI can obtain the state-of-the-art recognition accuracy on RGB, Depth, and Optical Flow data (61.22%, 58.84%, and 50.30% of the validation set of IsoGD, respectively).

Cite

CITATION STYLE

APA

Wan, S., Yang, L., Ding, K., & Qiu, D. (2023). Dynamic Gesture Recognition Based on Three-Stream Coordinate Attention Network and Knowledge Distillation. IEEE Access, 11, 50547–50559. https://doi.org/10.1109/ACCESS.2023.3278100

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free