Open-Vocabulary Video Relation Extraction

1Citations
Citations of this article
8Readers
Mendeley users who have this article in their library.

Abstract

A comprehensive understanding of videos is inseparable from describing the action with its contextual action-object interactions. However, many current video understanding tasks prioritize general action classification and overlook the actors and relationships that shape the nature of the action, resulting in a superficial understanding of the action. Motivated by this, we introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets. OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages. Moreover, we curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset. With Moments-OVRE, we further propose a cross-modal mapping model to generate relation triplets as a sequence. Finally, we benchmark existing cross-modal generation models on the new task of OVRE. Our code and dataset are available at https://github.com/Iriya99/OVRE.

Cite

CITATION STYLE

APA

Tian, W., Wang, Z., Fu, Y., Chen, J., & Cheng, L. (2024). Open-Vocabulary Video Relation Extraction. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, pp. 5215–5223). Association for the Advancement of Artificial Intelligence. https://doi.org/10.1609/aaai.v38i6.28328

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free