Abstract
Video Question Answering (VideoQA) is a challenging problem, as it requires a joint understanding of video and natural language question. Existing methods perform correlation learning between video and question have achieved great success. However, previous methods merely model relations between individual video frames (or clips) and words, which are not enough to correctly answer the question. From human's perspective, answering a video question should first summarize both visual and language information, and then explore their correlations for answer reasoning. In this paper, we propose a new method called Pairwise VLAD Interaction Network (PVI-Net) to address this problem. Specifically, we develop a learnable clustering-based VLAD encoder to respectively summarize video and question modalities into a small number of compact VLAD descriptors. For correlation learning, a pairwise VLAD interaction mechanism is proposed to better exploit complementary information for each pair of modality descriptors, avoiding modeling uninformative individual relations (e.g., frame-word and clip-word relations), and exploring both inter- and intra-modality relations simultaneously. Experimental results show that our approach achieves state-of-the-art performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA.
Author supplied keywords
Cite
CITATION STYLE
Wang, H., Guo, D., Hua, X. S., & Wang, M. (2021). Pairwise VLAD Interaction Network for Video Question Answering. In MM 2021 - Proceedings of the 29th ACM International Conference on Multimedia (pp. 5119–5127). Association for Computing Machinery, Inc. https://doi.org/10.1145/3474085.3475620
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.