Retrieval-augmented Video Encoding for Instructional Captioning

3Citations
Citations of this article
15Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Instructional videos make learning knowledge more efficient, by providing a detailed multimodal context of each procedure in instruction. A unique challenge posed by instructional videos is key-object degeneracy, where any single modality fails to sufficiently capture the key objects referred to in the procedure. For machine systems, such degeneracy can disturb the performance of a downstream task such as dense video captioning, leading to the generation of incorrect captions omitting key objects. To repair degeneracy, we propose a retrieval-based framework to augment the model representations in the presence of such key-object degeneracy. We validate the effectiveness and generalizability of our proposed framework over baselines using modalities with key-object degeneracy.

Cite

CITATION STYLE

APA

Jung, Y., Kim, M., Choi, S., Kim, J., Seo, M., & Hwang, S. W. (2023). Retrieval-augmented Video Encoding for Instructional Captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 8554–8568). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.543

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free