Along with automatic speech recognition, many researchers have been actively studyingspeech emotion recognition, since emotion information is as crucial as the textual information foreffective interactions. Emotion can be divided into categorical emotion and dimensional emotion.Although categorical emotion is widely used, dimensional emotion, typically represented as arousaland valence, can provide more detailed information on the emotional states. Therefore, in thispaper, we propose a Conformer-based model for arousal and valence recognition. Our model usesConformer as an encoder, a fully connected layer as a decoder, and statistical pooling layers as aconnector. In addition, we adopted multi-task learning and multi-feature combination, which showeda remarkable performance for speech emotion recognition and time-series analysis, respectively. Theproposed model achieves a state-of-the-art recognition accuracy of 70.0 ± 1.5% for arousal in termsof unweighted accuracy on the IEMOCAP dataset.
CITATION STYLE
Seo, J., & Lee, B. (2022). Multi-Task Conformer with Multi-Feature Combination for Speech Emotion Recognition. Symmetry, 14(7). https://doi.org/10.3390/sym14071428
Mendeley helps you to discover research relevant for your work.