Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts

47Citations
Citations of this article
16Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Grounding objects in visual context from natural language queries is a crucial yet challenging vision-and-language task, which has gained increasing attention in recent years. Existing work has primarily investigated this task in the context of still images. Despite their effectiveness, these methods cannot be directly migrated into the video context, mainly due to 1) the complex spatio-temporal structure of videos and 2) the scarcity of fine-grained annotations of videos. To effectively ground objects in videos is profoundly more challenging and less explored. To fill the research gap, this paper presents a weakly-supervised framework for linking objects mentioned in a sentence with the corresponding regions in videos. It mainly considers two types of video characteristics: 1) objects are dynamically distributed across multiple frames and have diverse temporal durations, and 2) object regions in videos are spatially correlated with each other. Specifically, we propose a weakly-supervised video object grounding approach which mainly consists of three modules: 1) a temporal localization module to model the latent relation between queried objects and frames with a temporal attention network, 2) a spatial interaction module to capture feature correlation among object regions for learning context-aware region representation, and 3) a hierarchical video multiple instance learning algorithm to estimate the sentence-segment grounding score for discriminative training. Extensive experiments demonstrate that our method can achieve consistent improvement over the state-of-the-arts.

References Powered by Scopus

Microsoft COCO: Common objects in context

28859Citations
N/AReaders
Get full text

Long-term recurrent convolutional networks for visual recognition and description

4112Citations
N/AReaders
Get full text

Deep visual-semantic alignments for generating image descriptions

3803Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Deconfounded Video Moment Retrieval with Causal Intervention

158Citations
N/AReaders
Get full text

Interventional Video Grounding with Dual Contrastive Learning

121Citations
N/AReaders
Get full text

Video Moment Retrieval with Cross-Modal Neural Architecture Search

92Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Yang, X., Liu, X., Jian, M., Gao, X., & Wang, M. (2020). Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (pp. 1939–1947). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3413610

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 3

60%

Researcher 2

40%

Readers' Discipline

Tooltip

Computer Science 4

80%

Engineering 1

20%

Save time finding and organizing research with Mendeley

Sign up for free