Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

7Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.
Get full text

Abstract

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse space shared between the video and the text for video-text retrieval. The shared sparse space is initialized with a finite number of sparse concepts, each of which refers to a number of words. With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarities. Benefiting from the learned shared sparse space and multi-grained similarities, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods. Our code is available at link.

Cite

CITATION STYLE

APA

Wang, Y., & Shi, P. (2023). Video-Text Retrieval by Supervised Sparse Multi-Grained Learning. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 633–649). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-emnlp.46

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free