CaptionNet: Automatic End-to-End Siamese Difference Captioning Model with Attention

34Citations
Citations of this article
28Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Several deep learning techniques have been intensively reviewed for captioning tasks, enabling the possibility of textual understanding, and description of both simple and complex images. In advancing this knowledge, this paper proposes a multimodal end-to-end siamese difference captioning model (SDCM) to automatically generate a natural language description of differences in an image pair. The proposed supervised learning model combines several deep learning techniques in exploring the practicability of capturing, aligning, and computing the disparities between two image features, for the purpose of creating corresponding language model probability distribution. First, a deep siamese convolutional neural network is used to extract the feature vector discrepancies of an image pair, and then an attention mechanism enables the detection of salient regions of the feature vector which effectively allows a bidirectional long short-term memory decoder to generate a matching and semantically associated textual sequence. The evaluation of the model is tested on the spot-the-diff baseline dataset which consists of pairs of images and their equivalent captions. The results indicate that our proposed model demonstrates a highly competitive performance in comparison to the state of the art.

Cite

CITATION STYLE

APA

Oluwasanmi, A., Aftab, M. U., Alabdulkreem, E., Kumeda, B., Baagyere, E. Y., & Qin, Z. (2019). CaptionNet: Automatic End-to-End Siamese Difference Captioning Model with Attention. IEEE Access, 7, 106773–106783. https://doi.org/10.1109/ACCESS.2019.2931223

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free