Images are more than a collection of objects or attributes - they represent a web of relationships among interconnected objects. Scene Graph has emerged as a new modality as a structured graphical representation of images. Scene Graph encodes objects as nodes connected via pairwise relations as edges. To support question answering on scene graphs, we propose GraphVQA, a language-guided graph neural network framework that translates and executes a natural language question as multiple iterations of message passing among graph nodes. We explore the design space of GraphVQA framework, and discuss the trade-off of different design choices. Our experiments on GQA dataset show that GraphVQA outperforms the state-of-the-art model by a large margin (88.43% vs. 94.78%). Our code is available at https://github.com/codexxxl/GraphVQA.
CITATION STYLE
Liang, W., Jiang, Y., & Liu, Z. (2021). GraphVQA: Language-Guided Graph Neural Networks for Scene Graph Question Answering. In Multimodal Artificial Intelligence, MAI Workshop 2021 - Proceedings of the 3rd Workshop (pp. 79–86). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.maiworkshop-1.12
Mendeley helps you to discover research relevant for your work.