Abstract
Instructional videos make learning knowledge more efficient, by providing a detailed multimodal context of each procedure in instruction. A unique challenge posed by instructional videos is key-object degeneracy, where any single modality fails to sufficiently capture the key objects referred to in the procedure. For machine systems, such degeneracy can disturb the performance of a downstream task such as dense video captioning, leading to the generation of incorrect captions omitting key objects. To repair degeneracy, we propose a retrieval-based framework to augment the model representations in the presence of such key-object degeneracy. We validate the effectiveness and generalizability of our proposed framework over baselines using modalities with key-object degeneracy.
Cite
CITATION STYLE
Jung, Y., Kim, M., Choi, S., Kim, J., Seo, M., & Hwang, S. W. (2023). Retrieval-augmented Video Encoding for Instructional Captioning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 8554–8568). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-acl.543
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.