Hierarchical context-aware network for dense video event captioning

8Citations
Citations of this article
67Readers
Mendeley users who have this article in their library.

Abstract

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video. Video-level context provides important information and facilities the model to generate consistent and less redundant captions between events. In this paper, we introduce a novel Hierarchical Context-aware Network for dense video event captioning (HCN) to capture context from various aspects. In detail, the model leverages local and global context with different mechanisms to jointly learn to generate coherent captions. The local context module performs full interaction between neighbor frames and the global context module selectively attends to previous or future events. According to our extensive experiment on both Youcook2 and Activitynet Captioning datasets, the video-level HCN model outperforms the event-level context-agnostic model by a large margin. The code is available at https://github.com/KirkGuo/HCN.

Cite

CITATION STYLE

APA

Ji, L., Guo, X., Huang, H., & Chen, X. (2021). Hierarchical context-aware network for dense video event captioning. In ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 2004–2013). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.acl-long.156

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free