Matching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independently. In this work, we propose a novel model named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching. It combines adversarial networks and attention mechanism to learn effective and robust multi-modal embeddings for better matching between the image and text. Adversarial learning is implemented as an interplay between two processes. First, two attention models are proposed to exploit two types of correlation between the image and text for multi-modal embedding learning and to confuse the other process. Then the discriminator tries to distinguish the two types of multi-modal embeddings learned by the two attention models, in which the two attention models are reinforced mutually. Through adversarial learning, it is expected that both the two embeddings from the attention models can well exploit the two types of correlation, and thus they can deceive the discriminator that they are generated from the other attention-based model. By integrating the attention mechanism and adversarial learning, the learned multi-modal embeddings are more effective for image and text matching. Extensive experiments have been conducted on the benchmark datasets of Flickr30K and MSCOCO to demonstrate the superiority of the proposed approaches over the state-of-the-art methods on image-text retrieval.
CITATION STYLE
Wei, K., & Zhou, Z. (2020). Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching. IEEE Access, 8, 96237–96248. https://doi.org/10.1109/ACCESS.2020.2996407
Mendeley helps you to discover research relevant for your work.