LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling

8Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.

Abstract

Recent large-scale video-language pre-trained models have shown appealing performance on various downstream tasks. However, the pre-training process is computationally expensive due to the requirement of millions of video-text pairs and the redundant data structure of each video. To mitigate these problems, we propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks, without heavy pre-training. To enhance the temporal modeling lacking in the image-language model, we propose to add temporal attention modules in the image encoder of BLIP with dynamic temporal scaling. Besides the model-wise adaptation, we also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text. Experimental results on text-video retrieval and video question answering show that the proposed LiteVL even outperforms previous video-language pre-trained models by a clear margin, though without any video-language pre-training.

Cite

CITATION STYLE

APA

Chen, D., Tao, C., Hou, L., Shang, L., Jiang, X., & Liu, Q. (2022). LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022 (pp. 7985–7997). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.emnlp-main.545

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free