Multimodal transformer augmented fusion for speech emotion recognition

16Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.

Abstract

Speech emotion recognition is challenging due to the subjectivity and ambiguity of emotion. In recent years, multimodal methods for speech emotion recognition have achieved promising results. However, due to the heterogeneity of data from different modalities, effectively integrating different modal information remains a difficulty and breakthrough point of the research. Moreover, in view of the limitations of feature-level fusion and decision-level fusion methods, capturing fine-grained modal interactions has often been neglected in previous studies. We propose a method named multimodal transformer augmented fusion that uses a hybrid fusion strategy, combing feature-level fusion and model-level fusion methods, to perform fine-grained information interaction within and between modalities. A Model-fusion module composed of three Cross-Transformer Encoders is proposed to generate multimodal emotional representation for modal guidance and information fusion. Specifically, the multimodal features obtained by feature-level fusion and text features are used to enhance speech features. Our proposed method outperforms existing state-of-the-art approaches on the IEMOCAP and MELD dataset.

Cite

CITATION STYLE

APA

Wang, Y., Gu, Y., Yin, Y., Han, Y., Zhang, H., Wang, S., … Quan, D. (2023). Multimodal transformer augmented fusion for speech emotion recognition. Frontiers in Neurorobotics, 17. https://doi.org/10.3389/fnbot.2023.1181598

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free