A case study on combining ASR and visual features for generating instructional video captions

22Citations
Citations of this article
85Readers
Mendeley users who have this article in their library.

Abstract

Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., "heat the oil in the pan") improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., "add oil" vs. "add olive oil") are disambiguated more easily via ASR tokens.

Cite

CITATION STYLE

APA

Hessel, J., Pang, B., Zhu, Z., & Soricut, R. (2019). A case study on combining ASR and visual features for generating instructional video captions. In CoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference (pp. 419–429). Association for Computational Linguistics. https://doi.org/10.18653/v1/K19-1039

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free