Stack-captioning: Coarse-to-fine learning for image captioning

Jiuxiang Gu; Jianfei Cai; Gang Wang; Tsuhan Chen

Conference ProceedingsOPEN ACCESS

Stack-captioning: Coarse-to-fine learning for image captioning

32nd AAAI Conference on Artificial Intelligence, AAAI 2018 (2018) 6837-6844

DOI: 10.1609/aaai.v32i1.12266

134Citations

160Readers

Abstract

The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Our proposed learning approach addresses the difficulty of vanishing gradients during training by providing a learning objective function that enforces intermediate supervisions. Particularly, we optimize our model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards, which simultaneously solves the well-known exposure bias problem and the loss-evaluation mismatch problem. We extensively evaluate the proposed approach on MSCOCO and show that our approach can achieve the state-of-the-art performance.

Cite

CITATION STYLE

APA

Gu, J., Cai, J., Wang, G., & Chen, T. (2018). Stack-captioning: Coarse-to-fine learning for image captioning. In 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 (pp. 6837–6844). AAAI press. https://doi.org/10.1609/aaai.v32i1.12266

Stack-captioning: Coarse-to-fine learning for image captioning

Abstract

Cite

Register to see more suggestions