Abstract
The widespread adoption of mobile Internet and the Internet of things (IoT) has led to a significant increase in the amount of video data. While video data are increasingly important, language and text remain the primary methods of interaction in everyday communication, text-based cross-modal retrieval has become a crucial demand in many applications. Most previous text-video retrieval works utilize implicit knowledge of pre-trained models such as contrastive language-image pre-training (CLIP) to boost retrieval performance. However, implicit knowledge only records the co-occurrence relationship existing in the data, and it cannot assist the model to understand specific words or scenes. Another type of out-of-domain knowledge—explicit knowledge—which is usually in the form of a knowledge graph, can play an auxiliary role in understanding the content of different modalities. Therefore, we study the application of external knowledge base in text-video retrieval model for the first time, and propose KnowER, a model based on knowledge enhancement for efficient text-video retrieval. The knowledge-enhanced model achieves state-of-the-art performance on three widely used text-video retrieval datasets, i.e., MSRVTT, DiDeMo, and MSVD.
Author supplied keywords
Cite
CITATION STYLE
Kou, H., Yang, Y., & Hua, Y. (2023). KnowER: Knowledge enhancement for efficient text-video retrieval. Intelligent and Converged Networks, 4(2), 93–105. https://doi.org/10.23919/ICN.2023.0009
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.