Abstract
Temporal sentence localization in videos aims to ground the best matched segment in an untrimmed video according to a given sentence query. Previous works in this field mainly rely on single-step attentional frameworks to align the temporal boundaries by a soft selection. Although they focus on the visual content relevant to the query, these attention strategies are insufficient to model complex video contents and restrict the higher-level reasoning demand for temporal relation. In this paper, we propose a novel deep rectification-modulation network (RMN), transforming this task into a multi-step reasoning process by repeating rectification and modulation. In each rectification-modulation layer, unlike existing methods directly conducting the cross-modal interaction, we first devise a rectification module to correct implicit attention misalignment which focuses on wrong position during the interaction process. Then, a modulation module is developed to model the frame-to-frame relation with the help of specific sentence information for better correlating and composing the video contents over time. With multiple such layers cascaded in depth, our RMN progressively refines video and query interactions, thus enabling a further precise localization. Experimental evaluations on three public datasets show that the proposed method achieves state-of-the-art performance.
Cite
CITATION STYLE
Liu, D., Qu, X., Dong, J., & Zhou, P. (2020). Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network. In COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (pp. 1841–1851). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.coling-main.167
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.