Discriminative Feature Learning for Speech Emotion Recognition

2Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

It is encouraged to see that the deep neural networks based speech emotion recognition (DNN-SER) models have achieved the state-of-the-art on public datasets. However, the performance of DNN-SER models is limited due to the following reasons: insufficient training data, emotion ambiguity and class imbalance. Studies show that, without large-scale training data, it is hard for DNN-SER model with cross-entropy loss to learn discriminative features by mapping the speech segments to their category labels. In this study, we propose a deep metric learning based DNN-SER model to facilitate the discriminative feature learning by constraining the feature embeddings in the feature space. For the proof of the concept, we take a four-hidden layer DNN as our backbone for implementation simplicity. Specifically, an emotion identity matrix is formed using one-hot label vectors as supervision information while the emotion embedding matrix is formed using the embedding vectors generated by DNN. An affinity loss is designed based on the above two matrices to simultaneously maximize the inter-class separability and intra-class compactness of the embeddings. Moreover, to restrain the class imbalance problem, the focal loss is introduced to reduce the adverse effect of the majority well-classified samples and gain more focus on the minority misclassified ones. Our proposed DNN-SER model is jointly trained using affinity loss and focal loss. Extensive experiments have been conducted on two well-known emotional speech datasets, EMO-DB and IEMOCAP. Compared to DNN-SER baseline, the unweighted accuracy (UA) on EMO-DB and IEMOCAP increased relatively by 10.19% and 10% respectively. Besides, from the confusion matrix of the test results on Emo-DB, it is noted that the accuracy of the most confusing emotion category, ‘Happiness’, increased relatively by 33.17% and the accuracy of the emotion category with the fewest samples, ‘Disgust’, increased relatively by 13.62%. These results validate the effectiveness of our proposed DNN-SER model and give the evidence that affinity loss and focal loss help to learn better discriminative features.

Cite

CITATION STYLE

APA

Zhang, Y., Zou, Y., Peng, J., Luo, D., & Huang, D. (2019). Discriminative Feature Learning for Speech Emotion Recognition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11730 LNCS, pp. 198–210). Springer Verlag. https://doi.org/10.1007/978-3-030-30490-4_17

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free