Coreference by Appearance: Visually Grounded Event Coreference Resolution

0Citations
Citations of this article
45Readers
Mendeley users who have this article in their library.

Abstract

Event coreference resolution is critical to understand events in growing number of online news with multiple modalities including text, video, speech, etc. However, the events and entities depicting in different modalities may not be perfectly aligned and can be difficult to annotate, which makes the task especially challenging with little supervision available. To address the above issues, we propose a supervised model based on attention mechanism and an unsupervised model based on statistical machine translation, capable of learning the relative importance of modalities for event coreference resolution. Experiments on a video multimedia event dataset show that our multimodal models outperform text-only systems in the event coreference resolution task. A careful analysis reveals that the performance gain of the multimodal model especially under the unsupervised setting comes from better learning of visually salient events.

Cite

CITATION STYLE

APA

Wang, L., Feng, S., Lin, X., Li, M., Ji, H., & Chang, S. F. (2021). Coreference by Appearance: Visually Grounded Event Coreference Resolution. In 4th Workshop on Computational Models of Reference, Anaphora and Coreference, CRAC 2021 - Proceedings of the Workshop (pp. 132–140). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.crac-1.14

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free