Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding

11Citations
Citations of this article
50Readers
Mendeley users who have this article in their library.

Abstract

Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidatefree framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction module, generates a fine-grained cross-modal semantic alignment map, and performs grounding directly on top of the map. Extensive experiments are conducted on two widelyused benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN achieves state-ofthe-art performance.

Cite

CITATION STYLE

APA

Wang, Y., Zhou, W., & Li, H. (2021). Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding. In Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 (pp. 89–99). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.findings-emnlp.9

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free