CLIP4Caption: CLIP for Video Caption

Mingkang Tang; Zhanyu Wang; Zhenhua Liu; Fengyun Rao; DIan Li; Xiu Li

Conference ProceedingsOPEN ACCESS

CLIP4Caption: CLIP for Video Caption

MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (2021) 4858-4862

DOI: 10.1145/3474085.3479207

110Citations

70Readers

Abstract

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM). This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation. Besides, unlike most existing models using LSTM or GRU as the sentence decoder, we adopt a Transformer structured decoder network to effectively learn the long-range visual and language dependency. Additionally, we introduce a novel ensemble strategy for captioning tasks. Experimental results demonstrate the effectiveness of our method on two datasets: 1) on MSR-VTT dataset, our method achieved a new state-of-the-art result with a significant gain of up to 10% in CIDEr; 2) on the private test data, our method ranking 2nd place in the ACM MM multimedia grand challenge 2021: Pre-training for Video Understanding Challenge. It is noted that our model is only trained on the MSR-VTT dataset.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Tang, M., Wang, Z., Liu, Z., Rao, F., Li, Di., & Li, X. (2021). CLIP4Caption: CLIP for Video Caption. In MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (pp. 4858–4862). Association for Computing Machinery, Inc. https://doi.org/10.1145/3474085.3479207

Readers' Seniority

PhD / Post grad / Masters / Doc 14

58%

Researcher 9

38%

Lecturer / Post doc 1

Readers' Discipline

Computer Science 26

87%

Engineering 2

Agricultural and Biological Sciences 1

Psychology 1

CLIP4Caption: CLIP for Video Caption

Abstract

Author supplied keywords

References Powered by Scopus

Long Short-Term Memory

CIDEr: Consensus-based image description evaluation

Temporal segment networks: Towards good practices for deep action recognition

Cited by Powered by Scopus

CRIS: CLIP-Driven Referring Image Segmentation

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

CLIP-Driven Fine-Grained Text-Image Person Re-Identification

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline