Bridging by word: Image-grounded vocabulary construction for visual captioning

14Citations
Citations of this article
135Readers
Mendeley users who have this article in their library.

Abstract

Existing research for visual captioning usually employs a CNN-RNN architecture that combines a CNN for image encoding with a RNN for caption generation, where the vocabulary is constructed from the entire training dataset as the decoding space. Such approaches typically suffer from the problem of generating N-grams which occur frequently in the training set but are irrelevant to the given image. To tackle this problem, we propose to construct an image-grounded vocabulary that leverages image semantics for more effective caption generation. More concretely, a two-step approach is proposed to construct the vocabulary by incorporating both visual information and relationships among words. Two strategies are then explored to utilize the constructed vocabulary for caption generation. One constrains the generator to select words from the image-grounded vocabulary only and the other integrates the vocabulary information into the RNN cell during the caption generation process. Experimental results on two public datasets show the effectiveness of our framework compared to state-of-the-art models. Our code is available on Github1

Cite

CITATION STYLE

APA

Fan, Z., Wei, Z., Wang, S., & Huang, X. (2020). Bridging by word: Image-grounded vocabulary construction for visual captioning. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 6514–6524). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p19-1652

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free