Emotions play an extremely important role in human decisions and interactions with both other humans and machines. This fact had promoted development of methods that aim to recognize emotions from different physiological signals. Particularly, emotion recognition from speech signals is still a research challenge due to the large voice variability between subjects. In this work, paralinguistic features and deep learning models are used to perform speech emotion classification. A set of 1582 INTERSPEECH 2010 features is initially extracted from the speech signals, which are then used to feed a deep convolutional stack auto-encoder network that transform those features in a higher level representation. Then, a multilayer perceptron is trained to classify the utterances in one of six emotions: anger, fear, disgust, happiness, surprise and sadness. The size of the auto-encoders was evaluated for 4 different architectures, in terms of performance, computational cost and execution time for obtaining the most suitable configuration model. Thus, the proposed approach was twofold evaluated. First, a 5-fold cross-validation strategy was performed using 70% of the samples. Then, the best network architecture was used to evaluate the classification in a validation set, composed of the remaining 30% of samples. Results report an overall accuracy of 91.4 in the 5-fold testing stage and 61, 1 in the validation set.
CITATION STYLE
Fonnegra, R. D., & Díaz, G. M. (2018). Speech emotion recognition integrating paralinguistic features and auto-encoders in a deep learning model. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10901 LNCS, pp. 385–396). Springer Verlag. https://doi.org/10.1007/978-3-319-91238-7_31
Mendeley helps you to discover research relevant for your work.