Pairwise VLAD Interaction Network for Video Question Answering

Hui Wang; Dan Guo; Xian Sheng Hua; Meng Wang

Conference ProceedingsOPEN ACCESS

Pairwise VLAD Interaction Network for Video Question Answering

MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (2021) 5119-5127

DOI: 10.1145/3474085.3475620

15Citations

9Readers

Get full text

Abstract

Video Question Answering (VideoQA) is a challenging problem, as it requires a joint understanding of video and natural language question. Existing methods perform correlation learning between video and question have achieved great success. However, previous methods merely model relations between individual video frames (or clips) and words, which are not enough to correctly answer the question. From human's perspective, answering a video question should first summarize both visual and language information, and then explore their correlations for answer reasoning. In this paper, we propose a new method called Pairwise VLAD Interaction Network (PVI-Net) to address this problem. Specifically, we develop a learnable clustering-based VLAD encoder to respectively summarize video and question modalities into a small number of compact VLAD descriptors. For correlation learning, a pairwise VLAD interaction mechanism is proposed to better exploit complementary information for each pair of modality descriptors, avoiding modeling uninformative individual relations (e.g., frame-word and clip-word relations), and exploring both inter- and intra-modality relations simultaneously. Experimental results show that our approach achieves state-of-the-art performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA.

Author supplied keywords

Cite

CITATION STYLE

APA

Wang, H., Guo, D., Hua, X. S., & Wang, M. (2021). Pairwise VLAD Interaction Network for Video Question Answering. In MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (pp. 5119–5127). Association for Computing Machinery, Inc. https://doi.org/10.1145/3474085.3475620

Pairwise VLAD Interaction Network for Video Question Answering

Abstract

Author supplied keywords

Cite

Register to see more suggestions