A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition

48Citations
Citations of this article
39Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Video-based facial expression recognition is a long-standing problem owing to a gap between visual features and emotions, difficulties in tracking the subtle movement of muscles and limited datasets. The key to solving this problem is to exploit effective features characterizing facial expression to perform facial expression recognition. We propose an effective framework to solve these problems. In our work, both spatial information and temporal information are utilized through the aggregation layer of a framework that fuses two state-of-the-art stream networks. We investigate different strategies for pooling across spatial information and temporal information. We find that it is effective to pool jointly across spatial information and temporal information for video-based facial expression recognition. Our framework is end-to-end trainable for whole-video recognition. In addressing the problem of facial recognition, the main contribution of this project is the design of a novel, trainable deep neural network framework that fuses spatial information and temporal information of video according to CNNs and LSTMs for pattern recognition. The experimental results on two public datasets, i.e., the RML and eNTERFACE05 databases, show that our framework outperforms previous state-of-the-art frameworks.

Cite

CITATION STYLE

APA

Pan, X., Ying, G., Chen, G., Li, H., & Li, W. (2019). A Deep Spatial and Temporal Aggregation Framework for Video-Based Facial Expression Recognition. IEEE Access, 7, 48807–48815. https://doi.org/10.1109/ACCESS.2019.2907271

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free