Exploring visual relationship for image captioning

Ting Yao; Yingwei Pan; Yehao Li; Tao Mei

Conference ProceedingsOPEN ACCESS

Exploring visual relationship for image captioning

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2018) 11218 LNCS 711-727

DOI: 10.1007/978-3-030-01264-9_42

146Citations

409Readers

Abstract

It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.

Author supplied keywords

Cite

CITATION STYLE

APA

Yao, T., Pan, Y., Li, Y., & Mei, T. (2018). Exploring visual relationship for image captioning. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11218 LNCS, pp. 711–727). Springer Verlag. https://doi.org/10.1007/978-3-030-01264-9_42

Exploring visual relationship for image captioning

Abstract

Author supplied keywords

Cite

Register to see more suggestions