Existing research for visual captioning usually employs a CNN-RNN architecture that combines a CNN for image encoding with a RNN for caption generation, where the vocabulary is constructed from the entire training dataset as the decoding space. Such approaches typically suffer from the problem of generating N-grams which occur frequently in the training set but are irrelevant to the given image. To tackle this problem, we propose to construct an image-grounded vocabulary that leverages image semantics for more effective caption generation. More concretely, a two-step approach is proposed to construct the vocabulary by incorporating both visual information and relationships among words. Two strategies are then explored to utilize the constructed vocabulary for caption generation. One constrains the generator to select words from the image-grounded vocabulary only and the other integrates the vocabulary information into the RNN cell during the caption generation process. Experimental results on two public datasets show the effectiveness of our framework compared to state-of-the-art models. Our code is available on Github1
CITATION STYLE
Fan, Z., Wei, Z., Wang, S., & Huang, X. (2020). Bridging by word: Image-grounded vocabulary construction for visual captioning. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (pp. 6514–6524). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/p19-1652
Mendeley helps you to discover research relevant for your work.