Span-based localizing network for natural language video localization

214Citations
Citations of this article
161Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query. Existing solutions formulate NLVL either as a ranking task and apply multimodal matching architecture, or as a regression task to directly regress the target video span. In this work, we address NLVL task with a span-based QA approach by treating the input video as text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework, to address NLVL. The proposed VSLNet tackles the differences between NLVL and span-based QA through a simple and yet effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to search for matching video span within a highlighted region. Through extensive experiments on three benchmark datasets, we show that the proposed VSLNet outperforms the state-of-the-art methods; and adopting span-based QA framework is a promising direction to solve NLVL.

Cite

CITATION STYLE

APA

Zhang, H., Sun, A., Jing, W., & Zhou, J. T. (2020). Span-based localizing network for natural language video localization. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 6543–6554). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.acl-main.585

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free