Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

Kaimin Wei; Zhibo Zhou

Journal ArticleOPEN ACCESS

Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

IEEE Access (2020) 8 96237-96248

DOI: 10.1109/ACCESS.2020.2996407

12Citations

15Readers

Abstract

Matching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independently. In this work, we propose a novel model named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching. It combines adversarial networks and attention mechanism to learn effective and robust multi-modal embeddings for better matching between the image and text. Adversarial learning is implemented as an interplay between two processes. First, two attention models are proposed to exploit two types of correlation between the image and text for multi-modal embedding learning and to confuse the other process. Then the discriminator tries to distinguish the two types of multi-modal embeddings learned by the two attention models, in which the two attention models are reinforced mutually. Through adversarial learning, it is expected that both the two embeddings from the attention models can well exploit the two types of correlation, and thus they can deceive the discriminator that they are generated from the other attention-based model. By integrating the attention mechanism and adversarial learning, the learned multi-modal embeddings are more effective for image and text matching. Extensive experiments have been conducted on the benchmark datasets of Flickr30K and MSCOCO to demonstrate the superiority of the proposed approaches over the state-of-the-art methods on image-text retrieval.

Author supplied keywords

Cite

CITATION STYLE

APA

Wei, K., & Zhou, Z. (2020). Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching. IEEE Access, 8, 96237–96248. https://doi.org/10.1109/ACCESS.2020.2996407

Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

Abstract

Author supplied keywords

Cite

Register to see more suggestions