This paper presents the University of Passau’s approaches for the Multimodal Emotion Recognition Challenge 2016. For audio signals, we exploit Bag-of-Audio-Words techniques combining Extreme Learning Machines and Hierarchical Extreme Learning Machines. For video signals, we use not only the information from the cropped face of a video frame, but also the broader contextual information from the entire frame. This information is extracted via two Convolutional Neural Networks pre-trained for face detection and object classification. Moreover, we extract facial action units, which reflect facial muscle movements and are known to be important for emotion recognition. Long Short-Term Memory Recurrent Neural Networks are deployed to exploit temporal information in the video representation. Average late fusion of audio and video systems is applied to make prediction for multimodal emotion recognition. Experimental results on the challenge database demonstrate the effectiveness of our proposed systems when compared to the baseline.
CITATION STYLE
Deng, J., Cummins, N., Han, J., Xu, X., Ren, Z., Pandit, V., … Schuller, B. (2016). The university of Passau open emotion recognition system for the multimodal emotion challenge. In Communications in Computer and Information Science (Vol. 663, pp. 652–666). Springer Verlag. https://doi.org/10.1007/978-981-10-3005-5_54
Mendeley helps you to discover research relevant for your work.