Fusing HOG and convolutional neural network spatial-temporal features for video-based facial expression recognition

34Citations
Citations of this article
31Readers
Mendeley users who have this article in their library.

Abstract

Video-based facial expression recognition (VFER) is the fundamental feature of various computer vision applications. Visual features are the key factors for facial expression recognition. However, the gap between the visual features and the emotions is large. In order to bridge the gap, the proposed method utilises convolutional neural networks (CNNs) and histogram of oriented gradient (HOG) to obtain the more comprehensive feature for VFER. Firstly, it extracts shallow features from the video frame through a number of convolutional kernels in CNNs, which has the characteristics of displacement, scale and deformation invariance. Then, the HOG is employed to extract HOG features from CNN's shallow features, which are strongly correlated with facial expressions. Finally, the support vector machine (SVM) is employed to conduct the task of facial expression recognition. The extensive experiments on RML, CK+ and AFEW5.0 database show that this framework takes on the promising performance and outperforming the state of the arts.

Cite

CITATION STYLE

APA

Pan, X. (2020). Fusing HOG and convolutional neural network spatial-temporal features for video-based facial expression recognition. IET Image Processing, 14(1), 176–182. https://doi.org/10.1049/iet-ipr.2019.0293

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free