Abstract
Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.
Author supplied keywords
Cite
CITATION STYLE
Zhao, H., Chen, Z., Guo, L., & Han, Z. (2022). Video captioning based on vision transformer and reinforcement learning. PeerJ Computer Science, 8. https://doi.org/10.7717/PEERJ-CS.916
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.