Pairwise VLAD Interaction Network for Video Question Answering

15Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Video Question Answering (VideoQA) is a challenging problem, as it requires a joint understanding of video and natural language question. Existing methods perform correlation learning between video and question have achieved great success. However, previous methods merely model relations between individual video frames (or clips) and words, which are not enough to correctly answer the question. From human's perspective, answering a video question should first summarize both visual and language information, and then explore their correlations for answer reasoning. In this paper, we propose a new method called Pairwise VLAD Interaction Network (PVI-Net) to address this problem. Specifically, we develop a learnable clustering-based VLAD encoder to respectively summarize video and question modalities into a small number of compact VLAD descriptors. For correlation learning, a pairwise VLAD interaction mechanism is proposed to better exploit complementary information for each pair of modality descriptors, avoiding modeling uninformative individual relations (e.g., frame-word and clip-word relations), and exploring both inter- and intra-modality relations simultaneously. Experimental results show that our approach achieves state-of-the-art performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA.

Cite

CITATION STYLE

APA

Wang, H., Guo, D., Hua, X. S., & Wang, M. (2021). Pairwise VLAD Interaction Network for Video Question Answering. In MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (pp. 5119–5127). Association for Computing Machinery, Inc. https://doi.org/10.1145/3474085.3475620

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free