Abstract
Attention mechanisms are widely used in Visual Question Answering (VQA) to search for visual clues related to the question. Most approaches train attention models from a coarse-grained association between sentences and images, which tends to fail on small objects or uncommon concepts. To address this problem, this paper proposes a multi-grained attention method. It learns explicit word-object correspondence by two types of word-level attention complementary to the sentence-image association. Evaluated on the VQA benchmark, the multi-grained attention model achieves competitive performance with state-of-the-art models. And the visualized attention maps demonstrate that addition of object-level groundings leads to a better understanding of the images and locates the attended objects more precisely.
Cite
CITATION STYLE
Huang, P., Huang, J., Guo, Y., Qiao, M., & Zhu, Y. (2020). Multi-grained attention with object-level grounding for visual question answering. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 3595–3600). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p19-1349
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.