Open-ended long-form video question answering is challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced long-form video content according to the question. However, the existing video question answering works mainly focus on the short-form video question answering, due to the lack of modeling the semantic representation of long-form video contents. In this paper, we consider the problem of long-form video question answering from the viewpoint of adaptive hierarchical reinforced encoder-decoder network learning. We propose the adaptive hierarchical encoder network to learn the joint representation of the longform video contents according to the question with adaptive video segmentation. we then develop the reinforced decoder network to generate the natural language answer for open-ended video question answering. We construct a large-scale long-form video question answering dataset. The extensive experiments show the effectiveness of our method.
CITATION STYLE
Zhao, Z., Zhang, Z., Xiao, S., Yu, Z., Yu, J., Cai, D., … Zhuang, Y. (2018). Open-ended long-form video question answering via adaptive hierarchical reinforced networks. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2018-July, pp. 3683–3689). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2018/512
Mendeley helps you to discover research relevant for your work.