End-to-end emotion recognition from speech with deep frame embeddings and neutral speech handling

Grigoriy Sterling; Eva Kazimirova

Book Chapter

End-to-end emotion recognition from speech with deep frame embeddings and neutral speech handling

Springer, (2020), 1123-1135

DOI: 10.1007/978-3-030-12385-7_76

1Citations

7Readers

Get full text

Abstract

In this paper we present a novel approach to improve machine learning techniques in emotion recognition from speech. The core idea is based on the fact that not all parts of the utterance convey emotion information. Thus, we propose to separate a given utterance into emotional and neutral parts and clean up the database to make it more univocal. Then we estimate short speech interval embeddings using speaker recognition convolutional neural network trained on the VoxCeleb2 dataset with the triplet loss. Sequences of these features are processed with a recurrent neural network to get an emotion label for the considered utterance. This stage consists of two sub-stages. At the first one we train a model to recognize neutral frames in a given utterance. Next we separate a corpus into emotional and neutral parts and train an improved model. Our experiments on the IEMOCAP corpus show that the final model achieves 66% of unweighted accuracy (UA) on four emotions and outperforms other known approaches like out-of-the-box Connectionist Temporal Classification (CTC) and local attention by more than 4%.

Author supplied keywords

Cite

CITATION STYLE

APA

Sterling, G., & Kazimirova, E. (2020). End-to-end emotion recognition from speech with deep frame embeddings and neutral speech handling. In Lecture Notes in Networks and Systems (Vol. 70, pp. 1123–1135). Springer. https://doi.org/10.1007/978-3-030-12385-7_76

End-to-end emotion recognition from speech with deep frame embeddings and neutral speech handling

Abstract

Author supplied keywords

Cite

Register to see more suggestions