Learning Joint Multimodal Representation Based on Multi-fusion Deep Neural Networks

11Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Recently, learning joint representation of multimodal data has received more and more attentions. Multimodal features are concept-level compositive features which are more effective than those single-modality features. Most existing methods only mine interactions between modalities on the top of their networks for one time to learn multi-modal representation. In this paper, we propose a multi-fusion deep learning framework which learns multimodal features richer in semantic. The framework sets multiple fusing points in different level of feature spaces, and then integrates and passes the fusing information step by step from the low level to higher levels. Moreover, we propose a multi-channel decoding network with alternate fine-tuning strategy to fully mine the modality–specific information and cross-modality correlations. We are also the first to introduce deep learning features into multimodal deep learning, alleviating the semantic and statistical property differences between modalities to learn better features. Extensive experiments on real-world datasets demonstrate that, our proposed method achieves superior performance compared with the state-of-the-art methods.

Cite

CITATION STYLE

APA

Gu, Z., Lang, B., Yue, T., & Huang, L. (2017). Learning Joint Multimodal Representation Based on Multi-fusion Deep Neural Networks. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10635 LNCS, pp. 276–285). Springer Verlag. https://doi.org/10.1007/978-3-319-70096-0_29

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free