Video question answering via hierarchical spatio-temporal attention networks

93Citations
Citations of this article
63Readers
Mendeley users who have this article in their library.

Abstract

Open-ended video question answering is a challenging problem in visual information retrieval, which automatically generates the natural language answer from the referenced video content according to the question. However, the existing visual question answering works only focus on the static image, which may be ineffectively applied to video question answering due to the lack of modeling the temporal dynamics of video contents. In this paper, we consider the problem of open-ended video question answering from the viewpoint of spatio-temporal attentional encoder-decoder learning framework. We propose the hierarchical spatio-temporal attention network for learning the joint representation of the dynamic video contents according to the given question. We then develop the spatio-temporal attentional encoder-decoder learning method with multi-step reasoning process for open-ended video question answering. We construct a large-scale video question answering dataset. The extensive experiments show the effectiveness of our method.

Cite

CITATION STYLE

APA

Zhao, Z., Yang, Q., Cai, D., He, X., & Zhuang, Y. (2017). Video question answering via hierarchical spatio-temporal attention networks. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 0, pp. 3518–3524). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2017/492

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free