CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

5Citations
Citations of this article
17Readers
Mendeley users who have this article in their library.

Abstract

This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly-demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., 3.13%−−−→ +119% 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at https://github.com/houzhijian/CONE.

Cite

CITATION STYLE

APA

Zhijian, H., Zhong, W., Ji, L., Gao, D., Yan, K., Chan, W. K., … Duan, N. (2023). CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 8013–8028). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.acl-long.445

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free