Weakly Supervised Grounding for VQA in Vision-Language Transformers

Aisha Urooj Khan; Hilde Kuehne; Chuang Gan; Niels Da Vitoria Lobo; Mubarak Shah

Conference Proceedings

Weakly Supervised Grounding for VQA in Vision-Language Transformers

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2022) 13695 LNCS 652-670

DOI: 10.1007/978-3-031-19833-5_38

2Citations

30Readers

Get full text

Abstract

Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. However, most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, this paper focuses on the problem of weakly supervised grounding in the context of visual question answering in transformers. Our approach leverages capsules by transforming each visual token into a capsule representation in the visual encoder; it then uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field. (Code is available at https://github.com/aurooj/WSG-VQA-VLTransformers)

Author supplied keywords

Cite

CITATION STYLE

APA

Khan, A. U., Kuehne, H., Gan, C., Lobo, N. D. V., & Shah, M. (2022). Weakly Supervised Grounding for VQA in Vision-Language Transformers. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13695 LNCS, pp. 652–670). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-19833-5_38

Weakly Supervised Grounding for VQA in Vision-Language Transformers

Abstract

Author supplied keywords

Cite

Register to see more suggestions