Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition

14Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.
Get full text

Abstract

This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.

Cite

CITATION STYLE

APA

Sun, M., Li, J., Feng, H., Gou, W., Shen, H., Tang, J., … Ye, J. (2020). Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition. In ICMI 2020 - Proceedings of the 2020 International Conference on Multimodal Interaction (pp. 835–840). Association for Computing Machinery, Inc. https://doi.org/10.1145/3382507.3417971

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free