Multimodal learning aims to process and relate information from different modalities to enhance the model's capacity for perception. Current multimodal fusion mechanisms either do not align the feature spaces closely or are expensive for training and inference. In this paper, we present X-Norm, a novel, simple and efficient method for bimodal fusion that generates and exchanges limited but meaningful normalization parameters between the modalities implicitly aligning the feature spaces. We conduct extensive experiments on two tasks of emotion and action recognition with different architectures including Transformer-based and CNN-based models using IEMOCAP and MSP-IMPROV for emotion recognition and EPIC-KITCHENS for action recognition. The experimental results show that X-Norm achieves comparable or superior performance compared to the existing methods including early and late fusion, Gradient-Blending (G-Blend) [44], Tensor Fusion Network, [48] and Multimodal Transformer [40], with a relatively low training cost.
CITATION STYLE
Yin, Y., Xu, J., Zu, T., & Soleymani, M. (2022). X-Norm: Exchanging Normalization Parameters for Bimodal Fusion. In ACM International Conference Proceeding Series (pp. 605–614). Association for Computing Machinery. https://doi.org/10.1145/3536221.3556581
Mendeley helps you to discover research relevant for your work.