Hierarchical Vision-Language Alignment for Video Captioning

Junchao Zhang; Yuxin Peng

Conference Proceedings

Hierarchical Vision-Language Alignment for Video Captioning

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2019) 11295 LNCS 42-54

DOI: 10.1007/978-3-030-05710-7_4

22Citations

6Readers

Get full text

Abstract

We have witnessed promising advances on video captioning in recent years, which is a challenging task since it is hard to capture the semantic correspondences between visual content and language descriptions. Different granularities of language components (e.g. words, phrases and sentences), are corresponding to different granularities of visual elements (e.g. objects, visual relations and interested regions). These correspondences can provide multi-level alignments and complementary information for transforming visual content to language descriptions. Therefore, we propose an Attention Guided Hierarchical Alignment (AGHA) approach for video captioning. In the proposed approach, hierarchical vision-language alignments, including object-word, relation-phrase, and region-sentence alignments, are extracted from a well-learned model that suits for multiple tasks related to vision and language, which are then embedded into parallel encoder-decoder streams to provide multi-level semantic guidance and rich complementarities on description generation. Besides, multi-granularity visual features are also exploited to obtain the coarse-to-fine understanding on complex video content, where an attention mechanism is applied to extract comprehensive visual discrimination to enhance video captioning. Experimental results on widely-used dataset MSVD demonstrate that AGHA achieves promising improvement on popular evaluation metrics.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhang, J., & Peng, Y. (2019). Hierarchical Vision-Language Alignment for Video Captioning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11295 LNCS, pp. 42–54). Springer Verlag. https://doi.org/10.1007/978-3-030-05710-7_4

Hierarchical Vision-Language Alignment for Video Captioning

Abstract

Author supplied keywords

Cite

Register to see more suggestions