Combining Tradition with Modernness: Exploring Event Representations in Vision-and-Language Models for Visual Goal-Step Inference

Chong Shen; Carina Silberer

Conference Proceedings

Combining Tradition with Modernness: Exploring Event Representations in Vision-and-Language Models for Visual Goal-Step Inference

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 4 254-265

DOI: 10.18653/v1/2023.acl-srw.36

0Citations

14Readers

Get full text

Abstract

Procedural knowledge understanding underlies the ability to infer goal-step relations. The task of Visual Goal-Step Inference addresses this ability in the multimodal domain. It requires the identification of images that depict the steps necessary to accomplish a textually expressed goal. The best existing methods encode texts and images either with independent encoders, or with object-level multimodal encoders using blackbox transformers. This stands in contrast to early, linguistically inspired methods for event representations, which focus on capturing the most crucial information, namely actions and participants, to learn stereotypical event sequences and hence procedural knowledge. In this work, we study various methods and their effects on procedural knowledge understanding of injecting the early shallow event representations to nowadays multimodal deep learning-based models. We find that the early, linguistically inspired methods for representing event knowledge do contribute to understand procedures in combination with modern vision- and-language models. This supports further exploration of more complex event structures in combination with large language models.

Cite

CITATION STYLE

APA

Shen, C., & Silberer, C. (2023). Combining Tradition with Modernness: Exploring Event Representations in Vision-and-Language Models for Visual Goal-Step Inference. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 4, pp. 254–265). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-srw.36

Combining Tradition with Modernness: Exploring Event Representations in Vision-and-Language Models for Visual Goal-Step Inference

Abstract

Cite

Register to see more suggestions