Multi-grained attention with object-level grounding for visual question answering

Pingping Huang; Jianhui Huang; Yuqing Guo; Min Qiao; Yong Zhu

Conference ProceedingsOPEN ACCESS

Multi-grained attention with object-level grounding for visual question answering

ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (2020) 3595-3600

DOI: 10.18653/v1/p19-1349

20Citations

149Readers

Abstract

Attention mechanisms are widely used in Visual Question Answering (VQA) to search for visual clues related to the question. Most approaches train attention models from a coarse-grained association between sentences and images, which tends to fail on small objects or uncommon concepts. To address this problem, this paper proposes a multi-grained attention method. It learns explicit word-object correspondence by two types of word-level attention complementary to the sentence-image association. Evaluated on the VQA benchmark, the multi-grained attention model achieves competitive performance with state-of-the-art models. And the visualized attention maps demonstrate that addition of object-level groundings leads to a better understanding of the images and locates the attended objects more precisely.

Cite

CITATION STYLE

APA

Huang, P., Huang, J., Guo, Y., Qiao, M., & Zhu, Y. (2020). Multi-grained attention with object-level grounding for visual question answering. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 3595–3600). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p19-1349

Multi-grained attention with object-level grounding for visual question answering

Abstract

Cite

Register to see more suggestions