Abstract
With the explosive growth of multimedia data on the Internet, cross-modal retrieval has attracted a great deal of attention in computer vision and multimedia community. However, this task is very challenging due to the heterogeneity gap between different modalities. Current approaches typically involve a common representation learning process that maps different data into a common space by linear or nonlinear functions. Yet most of them 1) only handle the dual-modal situation and generalize poorly to complex cases; 2) require example-level alignment of training data, which is often prohibitively expensive in practical applications; and 3) do not fully exploit prior knowledge about different modalities during the mapping process. In this paper, we address above issues by casting common representation learning as a Question Answer problem via a cross-modal memory neural network (CMMN). Specifically, raw features of all modalities are seemed as’Question’, and extra discriminator is exploited to select high-quality ones as’Statements’ for storage whereby common features are desired’Answer’. Experimental results show that CMMN can achieve state-of-the-art performance on the Wiki and COCO dataset and outperform other baselines on the large-scale scene dataset CMPlaces.
Cite
CITATION STYLE
Song, G., & Tan, X. (2017). Cross-modal retrieval via memory network. In British Machine Vision Conference 2017, BMVC 2017. BMVA Press. https://doi.org/10.5244/c.31.178
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.