Image captioning is a challenging task involving computer vision and natural language processing. In recent works, visual attention mechanisms have been extensively used. However, they consider little about the correlations among different regions and the attention on regions. This paper is try to make up for the deficiencies in existing approaches and propose a novel captioning model, which extracts the salient region correlations from the image feature, synthesizes intra-image regions’ context, and automatically distributes an appropriate attention over regions. The Intra-Image Region Context (IIRC) model proposed in this paper jointly learns regions’ semantic correlations in one image. It consists of two main parts. The first is to extract feature vectors of image through convolutional neural work (CNN) and get correlations among regions from feature vectors by recurrent neural network (RNN). The second is to generate the caption according to the synthesis of region contexts from the first network with attention on different region contexts. The model and baseline are evaluated on MSCOCO test server. The experimental results have illustrated that the model is superior over many outstanding models on the metrics of BLEU, METEOR, ROUGE-L and CIDEr. Moreover, the model excels in describing details, especially those related to position and action.
CITATION STYLE
Wang, S., Mo, H., Xu, Y., Wu, W., & Zhou, Z. (2018). Intra-image region context for image captioning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11166 LNCS, pp. 212–222). Springer Verlag. https://doi.org/10.1007/978-3-030-00764-5_20
Mendeley helps you to discover research relevant for your work.