Research on feature extraction and multimodal fusion of video caption based on deep learning

0Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Video Caption shows the objects, attributes and their relationship in natural language. It has been a very challenging research topic in the field of computer and multimedia. In this paper, the method of deep learning is used to extract the video frame feature, motion information, video sequence feature. And the multi-modal feature fusion method: feature cascade, model weighted average fusion are studied, and then the valuation is also studied. The experimental results show that the score of each evaluation in the model of weighted average fusion method is higher than that of the feature cascade method. The feature extraction methods and multimodal fusion methods in this paper have certain value for the application of video caption.

Cite

CITATION STYLE

APA

Chen, H., Li, H., & Wu, X. (2020). Research on feature extraction and multimodal fusion of video caption based on deep learning. In ACM International Conference Proceeding Series (pp. 73–76). Association for Computing Machinery. https://doi.org/10.1145/3380625.3380669

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free