KnowER: Knowledge enhancement for efficient text-video retrieval

2Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The widespread adoption of mobile Internet and the Internet of things (IoT) has led to a significant increase in the amount of video data. While video data are increasingly important, language and text remain the primary methods of interaction in everyday communication, text-based cross-modal retrieval has become a crucial demand in many applications. Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training (CLIP) to boost retrieval performance. However, implicit knowledge only records the co-occurrence relationship existing in the data, and it cannot assist the model to understand specific words or scenes. Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph, can play an auxiliary role in understanding the content of different modalities. Therefore, we study the application of external knowledge base in text-video retrieval model for the first time, and propose KnowER, a model based on knowledge enhancement for efficient text-video retrieval. The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets, i.e., MSRVTT, DiDeMo, and MSVD.

Cite

CITATION STYLE

APA

Kou, H., Yang, Y., & Hua, Y. (2023). KnowER: Knowledge enhancement for efficient text-video retrieval. Intelligent and Converged Networks, 4(2), 93–105. https://doi.org/10.23919/ICN.2023.0009

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free