The neural encoder-decoder framework is widely adopted for image captioning of natural images. However, few works have contributed to generating captions for cultural images using this scheme. In this paper, we propose an artwork type enriched image captioning model where the encoder represents an input artwork image as a 512-dimensional vector and the decoder generates a corresponding caption based on the input image vector. The artwork type is first predicted by a convolutional neural network classifier and then merged into the decoder. We investigate multiple approaches to integrate the artwork type into the captioning model among which is one that applies a step-wise weighted sum of the artwork type vector and the hidden representation vector of the decoder. This model outperforms three baseline image captioning models for a Chinese art image captioning dataset on all evaluation metrics. One of the baselines is a state-of-the-art approach fusing textual image attributes into the captioning model for natural images. The proposed model also obtains promising results for another Egyptian art image captioning dataset.
CITATION STYLE
Sheng, S., & Moens, M. F. (2019). Generating captions for images of ancient artworks. In MM 2019 - Proceedings of the 27th ACM International Conference on Multimedia (pp. 2478–2486). Association for Computing Machinery, Inc. https://doi.org/10.1145/3343031.3350972
Mendeley helps you to discover research relevant for your work.