CLIP4Caption: CLIP for Video Caption

110Citations
Citations of this article
70Readers
Mendeley users who have this article in their library.

Abstract

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM). This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation. Besides, unlike most existing models using LSTM or GRU as the sentence decoder, we adopt a Transformer structured decoder network to effectively learn the long-range visual and language dependency. Additionally, we introduce a novel ensemble strategy for captioning tasks. Experimental results demonstrate the effectiveness of our method on two datasets: 1) on MSR-VTT dataset, our method achieved a new state-of-the-art result with a significant gain of up to 10% in CIDEr; 2) on the private test data, our method ranking 2nd place in the ACM MM multimedia grand challenge 2021: Pre-training for Video Understanding Challenge. It is noted that our model is only trained on the MSR-VTT dataset.

References Powered by Scopus

Long Short-Term Memory

78359Citations
N/AReaders
Get full text

CIDEr: Consensus-based image description evaluation

3712Citations
N/AReaders
Get full text

Temporal segment networks: Towards good practices for deep action recognition

2388Citations
N/AReaders
Get full text

Cited by Powered by Scopus

CRIS: CLIP-Driven Referring Image Segmentation

272Citations
N/AReaders
Get full text

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

112Citations
N/AReaders
Get full text

CLIP-Driven Fine-Grained Text-Image Person Re-Identification

84Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Tang, M., Wang, Z., Liu, Z., Rao, F., Li, Di., & Li, X. (2021). CLIP4Caption: CLIP for Video Caption. In MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (pp. 4858–4862). Association for Computing Machinery, Inc. https://doi.org/10.1145/3474085.3479207

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 14

58%

Researcher 9

38%

Lecturer / Post doc 1

4%

Readers' Discipline

Tooltip

Computer Science 26

87%

Engineering 2

7%

Agricultural and Biological Sciences 1

3%

Psychology 1

3%

Save time finding and organizing research with Mendeley

Sign up for free