Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

3Citations
Citations of this article
51Readers
Mendeley users who have this article in their library.

Abstract

Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step. Comprehending PMDs and inducing their representations for the downstream reasoning tasks is designated as Procedural MultiModal Machine Comprehension (M3C). In this study, we approach Procedural M3C at a fine-grained level (compared with existing explorations at a document or sentence level), that is, entity. With delicate consideration, we model entity both in its temporal and cross-modal relation and propose a novel Temporal-Modal Entity Graph (TMEG). Specifically, a heterogeneous graph structure is formulated to capture textual and visual entities and trace their temporal-modal evolution. In addition, a graph aggregation module is introduced to conduct graph encoding and reasoning. Comprehensive experiments across three Procedural M3C tasks are conducted on a traditional dataset RecipeQA and our new dataset CraftQA, which can better evaluate the generalization of TMEG.

References Powered by Scopus

Deep residual learning for image recognition

176252Citations
N/AReaders
Get full text

Momentum Contrast for Unsupervised Visual Representation Learning

9456Citations
N/AReaders
Get full text

VQA: Visual question answering

3789Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering

7Citations
N/AReaders
Get full text

Learning a Contextualized Multi-modal Embedding for Zero-shot Cooking Video Caption Generation

0Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Zhang, H., Zhang, Z., Zhang, Y., Wang, J., Li, Y., Jiang, N., … Yang, Z. (2022). Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 1179–1189). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.acl-long.84

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 8

57%

Researcher 4

29%

Professor / Associate Prof. 1

7%

Lecturer / Post doc 1

7%

Readers' Discipline

Tooltip

Computer Science 13

72%

Agricultural and Biological Sciences 2

11%

Linguistics 2

11%

Neuroscience 1

6%

Save time finding and organizing research with Mendeley

Sign up for free