Multi-modal emotion recognition has gained increasing attention in recent years due to its widespread applications and the advances in multi-modal learning approaches. However, previous studies primarily focus on developing models that exploit the unification of multiple modalities. In this paper, we propose that maintaining modality independence is beneficial for the model performance. According to this principle, we construct a dataset, and devise a multimodal transformer model. The new dataset, CHinese Emotion Recognition dataset with Modality-wise Annotations, abbreviated as CHERMA, provides uni-modal labels for each individual modality, and multi-modal labels for all modalities jointly observed. The model consists of uni-modal transformer modules that learn representations for each modality, and a multi-modal transformer module that fuses all modalities. All the modules are supervised by their corresponding labels separately, and the forward information flow is uni-directionally from the uni-modal modules to the multimodal module. The supervision strategy and the model architecture guarantee each individual modality learns its representation independently, and meanwhile the multimodal module aggregates all information. Extensive empirical results demonstrate that our proposed scheme outperforms state-of-the-art alternatives, corroborating the importance of modality independence in multi-modal emotion recognition. The dataset and codes are availabel at https://github.com/sunjunaimer/LFMIM.
CITATION STYLE
Sun, J., Han, S., Ruan, Y. P., Zhang, X., Liu, Y., Huang, Y., … Li, T. (2023). Layer-wise Fusion with Modality Independence Modeling for Multi-modal Emotion Recognition. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 658–670). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.39
Mendeley helps you to discover research relevant for your work.