MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction

4Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.

Abstract

Given a query, the task of Natural Language Video Localization (NLVL) is to localize a temporal moment in an untrimmed video that semantically matches the query. In this paper, we adopt a proposal-based solution that generates proposals (i.e., candidate moments) and then select the best matching proposal. On top of modeling the cross-modal interaction between candidate moments and the query, our proposed Moment Sampling DETR (MS-DETR) enables efficient moment-moment relation modeling. The core idea is to sample a subset of moments guided by the learnable templates with an adopted DETR (DEtection TRansformer) framework. To achieve this, we design a multi-scale visual-linguistic encoder, and an anchor-guided moment decoder paired with a set of learnable templates. Experimental results on three public datasets demonstrate the superior performance of MS-DETR.

Cite

CITATION STYLE

APA

Wang, J., Sun, A., Zhang, H., & Li, X. (2023). MS-DETR: Natural Language Video Localization with Sampling Moment-Moment Interaction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 1387–1400). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.77

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free