What convnets make for image captioning?

3Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Nowadays, a general pipeline for the image captioning task takes advantage of image representations based on convolutional neural networks (CNNs) and sequence modeling based on recurrent neural net- works (RNNs). As captioning performance closely depends on the discriminative capacity of CNNs, our work aims to investigate the effects of different Convnets (CNN models) on image captioning. We train three Convnets based on different classification tasks: single-label, multi-label and multi-attribute, and then feed visual representations from these Convnets into a Long Short-Term Memory (LSTM) to model the sequence of words. Since the three Convnets focus on different visual contents in one image, we propose aggregating them together to generate a richer visual representation. Furthermore, during testing, we use an efficient multi-scale augmentation approach based on fully convolutional networks (FCNs). Extensive experiments on the MS COCO dataset provide significant insights into the effects of Convnets. Finally, we achieve com- parable results to the state-of-the-art for both caption generation and image-sentence retrieval tasks.

Cite

CITATION STYLE

APA

Liu, Y., Guo, Y., & Lew, M. S. (2017). What convnets make for image captioning? In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 10132 LNCS, pp. 416–428). Springer Verlag. https://doi.org/10.1007/978-3-319-51811-4_34

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free