Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

22Citations
Citations of this article
22Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The image caption generation algorithm necessitates the expression of image content using accurate natural language. Given the existing encoder-decoder algorithm structure, the decoder solely generates words one by one in a front-to-back order and is unable to analyze integral contextual information. This paper employs a Bi-LSTM (Bi-directional Long Short-Term Memory) structure, which not only draws on past information but also captures subsequent information, resulting in the prediction of image content subject to the context clues. The visual information is respectively fed into the F-LSTM decoder (forward LSTM decoder) and B-LSTM decoder (backward LSTM decoder) to extract semantic information, along with complementing semantic output. Specifically, the subsidiary attention mechanism S-Att acts between F-LSTM and B-LSTM, while the semantic information of B-LSTM and F-LSTM is extracted using the attention mechanism. Meanwhile, the semantic interaction is extracted pursuant to the similarity while aligning the hidden states, resulting in the output of the fused semantic information. We adopt a Bi-LSTM-s model capable of extracting contextual information and realizing finer-grained image captioning effectively. In the end, our model improved by 9.7% on the basis of the original LSTM. In addition, our model effectively solves the problem of inconsistent semantic information in the forward and backward direction of the simultaneous order, and gets a score of 37.5 on BLEU-4. The superiority of this approach is experimentally demonstrated on the MSCOCO dataset.

References Powered by Scopus

Deep residual learning for image recognition

178632Citations
N/AReaders
Get full text

Long Short-Term Memory

78428Citations
N/AReaders
Get full text

Microsoft COCO: Common objects in context

29679Citations
N/AReaders
Get full text

Cited by Powered by Scopus

GVA: guided visual attention approach for automatic image caption generation

16Citations
N/AReaders
Get full text

NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioning

7Citations
N/AReaders
Get full text

Human-object interaction detection based on cascade multi-scale transformer

6Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Zhang, H., Ma, C., Jiang, Z., & Lian, J. (2023). Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s. IEEE Access, 11, 134–143. https://doi.org/10.1109/ACCESS.2022.3232508

Readers over time

‘23‘24‘25036912

Readers' Seniority

Tooltip

Professor / Associate Prof. 1

50%

PhD / Post grad / Masters / Doc 1

50%

Readers' Discipline

Tooltip

Computer Science 3

100%

Article Metrics

Tooltip
Mentions
News Mentions: 1

Save time finding and organizing research with Mendeley

Sign up for free
0