Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition

9Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.

Abstract

Scene recognition is one of the hot topics in micro-video understanding, where multi-modal information is commonly used due to its efficient representation ability. However, there are some challenges in the usage of multi-modal information because the semantic consistency among multiple modalities in micro-videos is weaker than in traditional videos, and the influences of multi-modal information in micro-videos are always different. To address these issues, a multi-modal enhancement semantic learning method is proposed for micro-video scene recognition in this study. In the proposed method, the visual modality is considered the main modality whereas other modalities such as text and audio are considered auxiliary modalities. We propose a deep multi-modal fusion network for scene recognition with enhanced the semantics of auxiliary modalities using the main modality. Furthermore, the fusion weight of multi-modal can be adaptively learned in the proposed method. The experiments demonstrate the effectiveness of enhancement and adaptive weight learning in the multi-modal fusion of the micro-video scene recognition.

Cite

CITATION STYLE

APA

Guo, J., Nie, X., & Yin, Y. (2020). Mutual Complementarity: Multi-Modal Enhancement Semantic Learning for Micro-Video Scene Recognition. IEEE Access, 8, 29518–29524. https://doi.org/10.1109/ACCESS.2020.2973240

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free