Coherent multi-sentence video description with variable level of detail

Anna Rohrbach; Marcus Rohrbach; Wei Qiu; Annemarie Friedrich; Manfred Pinkal; Bernt Schiele

Conference Proceedings

Coherent multi-sentence video description with variable level of detail

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2014) 8753 184-195

DOI: 10.1007/978-3-319-11752-2_15

107Citations

81Readers

Get full text

Abstract

Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptions’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.

Cite

CITATION STYLE

APA

Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., & Schiele, B. (2014). Coherent multi-sentence video description with variable level of detail. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8753, pp. 184–195). Springer Verlag. https://doi.org/10.1007/978-3-319-11752-2_15

Coherent multi-sentence video description with variable level of detail

Abstract

Cite

Register to see more suggestions