Retrieval-augmented Video Encoding for Instructional Captioning

Yeonjoon Jung; Minsoo Kim; Seungtaek Choi; Jihyuk Kim; Minji Seo; Seung Won Hwang

Conference Proceedings

Retrieval-augmented Video Encoding for Instructional Captioning

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2023) 8554-8568

DOI: 10.18653/v1/2023.findings-acl.543

3Citations

15Readers

Get full text

Abstract

Instructional videos make learning knowledge more efficient, by providing a detailed multimodal context of each procedure in instruction. A unique challenge posed by instructional videos is key-object degeneracy, where any single modality fails to sufficiently capture the key objects referred to in the procedure. For machine systems, such degeneracy can disturb the performance of a downstream task such as dense video captioning, leading to the generation of incorrect captions omitting key objects. To repair degeneracy, we propose a retrieval-based framework to augment the model representations in the presence of such key-object degeneracy. We validate the effectiveness and generalizability of our proposed framework over baselines using modalities with key-object degeneracy.

Cite

CITATION STYLE

APA

Jung, Y., Kim, M., Choi, S., Kim, J., Seo, M., & Hwang, S. W. (2023). Retrieval-augmented Video Encoding for Instructional Captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 8554–8568). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.543

Retrieval-augmented Video Encoding for Instructional Captioning

Abstract

Cite

Register to see more suggestions