A hierarchical approach to vision-based language generation: from simple sentences to complex natural language

2Citations
Citations of this article
62Readers
Mendeley users who have this article in their library.

Abstract

Automatically describing videos in natural language is an ambitious problem, which could bridge our understanding of vision and language. We propose a hierarchical approach, by first generating video descriptions as sequences of simple sentences, followed at the next level by a more complex and fluent description in natural language. While the simple sentences describe simple actions in the form of (subject, verb, object), the second-level paragraph descriptions, indirectly using information from the first-level description, presents the visual content in a more compact, coherent and semantically rich manner. To this end, we introduce the first video dataset in the literature that is annotated with captions at two levels of linguistic complexity. We perform extensive tests that demonstrate that our hierarchical linguistic representation, from simple to complex language, allows us to train a two-stage network that is able to generate significantly more complex paragraphs than current one-stage approaches.

Cite

CITATION STYLE

APA

Bogolin, S. V., Croitoru, I., & Leordeanu, M. (2020). A hierarchical approach to vision-based language generation: from simple sentences to complex natural language. In COLING 2020 - 28th International Conference on Computational Linguistics, Proceedings of the Conference (pp. 2436–2447). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.coling-main.220

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free