Learning joint embedding with multimodal cues for cross-modal video-text retrieval

248Citations
Citations of this article
117Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval methods by learning joint representations, the video-text retrieval task, however, has not been explored to its fullest extent. In this paper, we study how to effectively utilize available multimodal cues from videos for the cross-modal video-text retrieval task. Based on our analysis, we propose a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval. Furthermore, we explore several loss functions in training the embedding and propose a modified pairwise ranking loss for the task. Experiments on MSVD and MSR-VTT datasets demonstrate that our method achieves significant performance gain compared to the state-of-the-art approaches.

Cite

CITATION STYLE

APA

Mithun, N. C., Li, J., Metze, F., & Roy-Chowdhury, A. K. (2018). Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR 2018 - Proceedings of the 2018 ACM International Conference on Multimedia Retrieval (pp. 19–27). Association for Computing Machinery, Inc. https://doi.org/10.1145/3206025.3206064

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free