End-to-end emotion recognition from speech with deep frame embeddings and neutral speech handling

1Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper we present a novel approach to improve machine learning techniques in emotion recognition from speech. The core idea is based on the fact that not all parts of the utterance convey emotion information. Thus, we propose to separate a given utterance into emotional and neutral parts and clean up the database to make it more univocal. Then we estimate short speech interval embeddings using speaker recognition convolutional neural network trained on the VoxCeleb2 dataset with the triplet loss. Sequences of these features are processed with a recurrent neural network to get an emotion label for the considered utterance. This stage consists of two sub-stages. At the first one we train a model to recognize neutral frames in a given utterance. Next we separate a corpus into emotional and neutral parts and train an improved model. Our experiments on the IEMOCAP corpus show that the final model achieves 66% of unweighted accuracy (UA) on four emotions and outperforms other known approaches like out-of-the-box Connectionist Temporal Classification (CTC) and local attention by more than 4%.

Cite

CITATION STYLE

APA

Sterling, G., & Kazimirova, E. (2020). End-to-end emotion recognition from speech with deep frame embeddings and neutral speech handling. In Lecture Notes in Networks and Systems (Vol. 70, pp. 1123–1135). Springer. https://doi.org/10.1007/978-3-030-12385-7_76

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free