Video captioning based on vision transformer and reinforcement learning

20Citations
Citations of this article
21Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.

Cite

CITATION STYLE

APA

Zhao, H., Chen, Z., Guo, L., & Han, Z. (2022). Video captioning based on vision transformer and reinforcement learning. PeerJ Computer Science, 8. https://doi.org/10.7717/PEERJ-CS.916

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free