SODA: Story Oriented Dense Video Captioning Evaluation Framework

Soichiro Fujita; Tsutomu Hirao; Hidetaka Kamigaito; Manabu Okumura; Masaaki Nagata

Conference Proceedings

SODA: Story Oriented Dense Video Captioning Evaluation Framework

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020) 12351 LNCS 517-531

DOI: 10.1007/978-3-030-58539-6_31

11Citations

60Readers

Get full text

Abstract

Dense Video Captioning (DVC) is a challenging task that localizes all events in a short video and describes them with natural language sentences. The main goal of DVC is video story description, that is, to generate a concise video story that supports human video comprehension without watching it. In recent years, DVC has attracted increasing attention in the vision and language research community, and has been employed as a task of the workshop, ActivityNet Challenge. In the current research community, the official scorer provided by ActivityNet Challenge is the de-facto standard evaluation framework for DVC systems. It computes averaged METEOR scores for matched pairs between generated and reference captions whose Intersection over Union (IoU) exceeds a specific threshold value. However, the current framework does not take into account the story of the video or the ordering of captions. It also tends to give high scores to systems that generate several hundred redundant captions, that humans cannot read. This paper proposes a new evaluation framework, Story Oriented Dense video cAptioning evaluation framework (SODA), for measuring the performance of video story description systems. SODA first tries to find temporally optimal matching between generated and reference captions to capture the story of a video. Then, it computes METEOR scores for the matching and derives F-measure scores from the METEOR scores to penalize redundant captions. To demonstrate that SODA gives low scores for inadequate captions in terms of video story description, we evaluate two state-of-the-art systems with it, varying the number of captions. The results show that SODA gives low scores against too many or too few captions and high scores against captions whose number equals to that of a reference, while the current framework gives good scores for all the cases. Furthermore, we show that SODA tends to give lower scores than the current evaluation framework in evaluating captions in the incorrect order.

Author supplied keywords

Cite

CITATION STYLE

APA

Fujita, S., Hirao, T., Kamigaito, H., Okumura, M., & Nagata, M. (2020). SODA: Story Oriented Dense Video Captioning Evaluation Framework. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12351 LNCS, pp. 517–531). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58539-6_31

SODA: Story Oriented Dense Video Captioning Evaluation Framework

Abstract

Author supplied keywords

Cite

Register to see more suggestions