X-Norm: Exchanging Normalization Parameters for Bimodal Fusion

1Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

Abstract

Multimodal learning aims to process and relate information from different modalities to enhance the model's capacity for perception. Current multimodal fusion mechanisms either do not align the feature spaces closely or are expensive for training and inference. In this paper, we present X-Norm, a novel, simple and efficient method for bimodal fusion that generates and exchanges limited but meaningful normalization parameters between the modalities implicitly aligning the feature spaces. We conduct extensive experiments on two tasks of emotion and action recognition with different architectures including Transformer-based and CNN-based models using IEMOCAP and MSP-IMPROV for emotion recognition and EPIC-KITCHENS for action recognition. The experimental results show that X-Norm achieves comparable or superior performance compared to the existing methods including early and late fusion, Gradient-Blending (G-Blend) [44], Tensor Fusion Network, [48] and Multimodal Transformer [40], with a relatively low training cost.

Cite

CITATION STYLE

APA

Yin, Y., Xu, J., Zu, T., & Soleymani, M. (2022). X-Norm: Exchanging Normalization Parameters for Bimodal Fusion. In ACM International Conference Proceeding Series (pp. 605–614). Association for Computing Machinery. https://doi.org/10.1145/3536221.3556581

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free