Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval

Po Yao Huang; Xiaojun Chang; Alexander Hauptmann; Eduard Hovy

Conference ProceedingsOPEN ACCESS

Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval

ICMR 2020 - Proceedings of the 2020 International Conference on Multimedia Retrieval (2020) 53-62

DOI: 10.1145/3372278.3390674

4Citations

10Readers

Abstract

We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of "A picture is worth a thousand words", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.

Author supplied keywords

Cite

CITATION STYLE

APA

Huang, P. Y., Chang, X., Hauptmann, A., & Hovy, E. (2020). Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval. In ICMR 2020 - Proceedings of the 2020 International Conference on Multimedia Retrieval (pp. 53–62). Association for Computing Machinery, Inc. https://doi.org/10.1145/3372278.3390674

Forward and backward multimodal nmt for improved monolingual and multilingual cross-modal retrieval

Abstract

Author supplied keywords

Cite

Register to see more suggestions