Abstract
This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.
Author supplied keywords
Cite
CITATION STYLE
Sun, M., Li, J., Feng, H., Gou, W., Shen, H., Tang, J., … Ye, J. (2020). Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition. In ICMI 2020 - Proceedings of the 2020 International Conference on Multimodal Interaction (pp. 835–840). Association for Computing Machinery, Inc. https://doi.org/10.1145/3382507.3417971
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.