Feature extraction and emotional classification are significant roles in speech emotion recognition. It is hard to extract and select the optimal features, researchers can not be sure what the features should be. With deep learning approaches, features could be extracted by using hierarchical abstraction layers, but it requires high computational resources and a large number of data. In this article, we choose static, differential, and acceleration coefficients of log Mel-spectrogram as inputs for the deep learning model. To avoid performance degradation, we also add a skip connection with dilated convolution network integration. All representatives are fed into a self-attention mechanism with bidirectional recurrent neural networks to learn long term global features and exploit context for each time step. Finally, we investigate contrastive center loss with softmax loss as loss function to improve the accuracy of emotion recognition. For validating robustness and effectiveness, we tested the proposed method on the Emo-DB and ERC2019 datasets. Experimental results show that the performance of the proposed method is strongly comparable with the existing state-of-the-art methods on the Emo-DB and ERC2019 with 88% and 67%, respectively. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium provided the original work is properly cited.
CITATION STYLE
Pham, N. T., Dang, D. N. M., & Nguyen, S. D. (2020). A Method upon Deep Learning for Speech Emotion Recognition. Journal of Advanced Engineering and Computation, 4(4), 273. https://doi.org/10.25073/jaec.202044.311
Mendeley helps you to discover research relevant for your work.