Multi-modal emotion recognition aims to recognize emotion states from multi-modal inputs. Existing end-to-end models typically fuse the uni-modal representations in the last layers without leveraging the multi-modal interactions among the intermediate representations. In this paper, we propose the multi-modal Recurrent Intermediate-Layer Aggregation (RILA) model to explore the effectiveness of leveraging the multi-modal interactions among the intermediate representations of deep pre-trained transformers for end-to-end emotion recognition. At the heart of our model is the Intermediate-Representation Fusion Module (IRFM), which consists of the multi-modal aggregation gating module and multi-modal token attention module. Specifically, at each layer, we first use the multi-modal aggregation gating module to capture the utterance-level interactions across the modalities and layers. Then we utilize the multi-modal token attention module to leverage the token-level multi-modal interactions. The experimental results on IEMOCAP and CMU-MOSEI show that our model achieves the state-of-the-art performance, benefiting from fully exploiting the multi-modal interactions among the intermediate representations.
CITATION STYLE
Wu, Y., Zhang, Z., Peng, P., Zhao, Y., & Qin, B. (2022). Leveraging Multi-modal Interactions among the Intermediate Representations of Deep Transformers for Emotion Recognition. In MuSe 2022 - Proceedings of the 3rd International Multimodal Sentiment Analysis Workshop and Challenge (pp. 101–109). Association for Computing Machinery, Inc. https://doi.org/10.1145/3551876.3554813
Mendeley helps you to discover research relevant for your work.