Multi-Task Conformer with Multi-Feature Combination for Speech Emotion Recognition

8Citations
Citations of this article
11Readers
Mendeley users who have this article in their library.

Abstract

Along with automatic speech recognition, many researchers have been actively studyingspeech emotion recognition, since emotion information is as crucial as the textual information foreffective interactions. Emotion can be divided into categorical emotion and dimensional emotion.Although categorical emotion is widely used, dimensional emotion, typically represented as arousaland valence, can provide more detailed information on the emotional states. Therefore, in thispaper, we propose a Conformer-based model for arousal and valence recognition. Our model usesConformer as an encoder, a fully connected layer as a decoder, and statistical pooling layers as aconnector. In addition, we adopted multi-task learning and multi-feature combination, which showeda remarkable performance for speech emotion recognition and time-series analysis, respectively. Theproposed model achieves a state-of-the-art recognition accuracy of 70.0 ± 1.5% for arousal in termsof unweighted accuracy on the IEMOCAP dataset.

Cite

CITATION STYLE

APA

Seo, J., & Lee, B. (2022). Multi-Task Conformer with Multi-Feature Combination for Speech Emotion Recognition. Symmetry, 14(7). https://doi.org/10.3390/sym14071428

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free