We explore how a multi-modal transformer trained for generation of longer image descriptions learns syntactic and semantic representations about entities and relations grounded in objects at the level of masked self-attention (text generation) and cross-modal attention (information fusion). We observe that cross-attention learns the visual grounding of noun phrases into objects and high-level semantic information about spatial relations, while text-to-text attention captures low-level syntactic knowledge between words. This concludes that language models in a multi-modal task learn different semantic information about objects and relations cross-modally and uni-modally (text-only). Our code is available here: https://github.com/GU-CLASP/attention-as-grounding.
CITATION STYLE
Ilinykh, N., & Dobnik, S. (2022). Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 4062–4073). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-acl.320
Mendeley helps you to discover research relevant for your work.