VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

140Citations
Citations of this article
268Readers
Mendeley users who have this article in their library.

Abstract

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-ofthe-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT.

Cite

CITATION STYLE

APA

Xu, H., Ghosh, G., Huang, P. Y., Okhonko, D., Aghajanyan, A., Metze, F., … Feichtenhofer, C. (2021). VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding. In EMNLP 2021 - 2021 Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 6787–6800). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.emnlp-main.544

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free