Video Question Answering (Video QA) is one of the important and challenging problems in multimedia and computer vision research. In this paper, we propose a novel framework, called initialized frame attention networks (IFAN). This framework uses long short term memory (LSTM) networks to encode visual information of videos, then initializes the language model by the encoded features. Based on the visual and semantic features, we can get an appropriate answer. In particular, in this IFAN framework, we effectively integrate temporal attention mechanism to focus on the salient frames of videos, which are associated to the questions. In order to verify the effectiveness of the proposed framework, we conduct experiments on TACoS dataset. It achieves good performances on both hard level and easy level of TACoS dataset.
CITATION STYLE
Gao, K., Zhu, X., & Han, Y. (2018). Initialized frame attention networks for video question answering. In Communications in Computer and Information Science (Vol. 819, pp. 349–359). Springer Verlag. https://doi.org/10.1007/978-981-10-8530-7_34
Mendeley helps you to discover research relevant for your work.