Progressive Learning for Image Retrieval with Hybrid-Modality Queries

24Citations
Citations of this article
13Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR), is a retrieval task where the search intention is expressed in a more complex query format, involving both vision and text modalities. For example, a target product image is searched using a reference product image along with text about changing certain attributes of the reference image as the query. It is a more challenging image retrieval task that requires both semantic space learning and cross-modal fusion. Previous approaches that attempt to deal with both aspects achieve unsatisfactory performance. In this paper, we decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries. We first leverage the semantic embedding space for open-domain image-text retrieval, and then transfer the learned knowledge to the fashion-domain with fashion-related pre-training tasks. Finally, we enhance the pre-trained model from single-query to hybrid-modality query for the CTI-IR task. Furthermore, as the contribution of individual modality in the hybrid-modality query varies for different retrieval scenarios, we propose a self-supervised adaptive weighting strategy to dynamically determine the importance of image and text in the hybrid-modality query for better retrieval. Extensive experiments show that our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.

References Powered by Scopus

Deep residual learning for image recognition

174047Citations
N/AReaders
Get full text

Show and tell: A neural image caption generator

4679Citations
N/AReaders
Get full text

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

4012Citations
N/AReaders
Get full text

Cited by Powered by Scopus

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

23Citations
N/AReaders
Get full text

Dynamic Weighted Combiner for Mixed-Modal Image Retrieval

5Citations
N/AReaders
Get full text

Self-Training Boosted Multi-Factor Matching Network for Composed Image Retrieval

5Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Zhao, Y., Song, Y., & Jin, Q. (2022). Progressive Learning for Image Retrieval with Hybrid-Modality Queries. In SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1012–1021). Association for Computing Machinery, Inc. https://doi.org/10.1145/3477495.3532047

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 5

71%

Researcher 2

29%

Readers' Discipline

Tooltip

Computer Science 6

75%

Business, Management and Accounting 1

13%

Veterinary Science and Veterinary Medic... 1

13%

Save time finding and organizing research with Mendeley

Sign up for free