Recent works show the possibility of transferring the CLIP (Contrastive Language-Image Pretraining) model for video-text retrieval with promising performance. However, due to the domain gap between static images and videos, CLIP-based video-text retrieval models with interaction-based matching perform far worse than models with representation-based matching. In this paper, we propose a novel image animation strategy to transfer the image-text CLIP model to video-text retrieval effectively. By imitating the video shooting components, we convert widely used image-language corpus to synthesized video-text data for pretraining. To reduce the time complexity of interaction matching, we further propose a coarse to fine framework which consists of dual encoders for fast candidates searching and a cross-modality interaction module for fine-grained re-ranking. The coarse to fine framework with the synthesized video-text pretraining provides significant gains in retrieval accuracy while preserving efficiency. Comprehensive experiments conducted on MSR-VTT, MSVD, and VATEX datasets demonstrate the effectiveness of our approach.
CITATION STYLE
Liu, Y., Chen, H., Huang, L., Chen, D., Wang, B., Pan, P., & Wang, L. (2022). Animating Images to Transfer CLIP for Video-Text Retrieval. In SIGIR 2022 - Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1906–1911). Association for Computing Machinery, Inc. https://doi.org/10.1145/3477495.3531776
Mendeley helps you to discover research relevant for your work.