Hierarchical context-aware network for dense video event captioning

Lei Ji; Xianglin Guo; Haoyang Huang; Xilin Chen

Conference ProceedingsOPEN ACCESS

Hierarchical context-aware network for dense video event captioning

ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference (2021) 2004-2013

DOI: 10.18653/v1/2021.acl-long.156

8Citations

67Readers

Abstract

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video. Video-level context provides important information and facilities the model to generate consistent and less redundant captions between events. In this paper, we introduce a novel Hierarchical Context-aware Network for dense video event captioning (HCN) to capture context from various aspects. In detail, the model leverages local and global context with different mechanisms to jointly learn to generate coherent captions. The local context module performs full interaction between neighbor frames and the global context module selectively attends to previous or future events. According to our extensive experiment on both Youcook2 and Activitynet Captioning datasets, the video-level HCN model outperforms the event-level context-agnostic model by a large margin. The code is available at https://github.com/KirkGuo/HCN.

Cite

CITATION STYLE

APA

Ji, L., Guo, X., Huang, H., & Chen, X. (2021). Hierarchical context-aware network for dense video event captioning. In ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference (pp. 2004–2013). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.acl-long.156

Hierarchical context-aware network for dense video event captioning

Abstract

Cite

Register to see more suggestions