Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

Huawei Zhang; Chengbo Ma; Zhanjun Jiang; Jing Lian

Journal ArticleOPEN ACCESS

Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

IEEE Access (2023) 11 134-143

DOI: 10.1109/ACCESS.2022.3232508

22Citations

22Readers

Abstract

The image caption generation algorithm necessitates the expression of image content using accurate natural language. Given the existing encoder-decoder algorithm structure, the decoder solely generates words one by one in a front-to-back order and is unable to analyze integral contextual information. This paper employs a Bi-LSTM (Bi-directional Long Short-Term Memory) structure, which not only draws on past information but also captures subsequent information, resulting in the prediction of image content subject to the context clues. The visual information is respectively fed into the F-LSTM decoder (forward LSTM decoder) and B-LSTM decoder (backward LSTM decoder) to extract semantic information, along with complementing semantic output. Specifically, the subsidiary attention mechanism S-Att acts between F-LSTM and B-LSTM, while the semantic information of B-LSTM and F-LSTM is extracted using the attention mechanism. Meanwhile, the semantic interaction is extracted pursuant to the similarity while aligning the hidden states, resulting in the output of the fused semantic information. We adopt a Bi-LSTM-s model capable of extracting contextual information and realizing finer-grained image captioning effectively. In the end, our model improved by 9.7% on the basis of the original LSTM. In addition, our model effectively solves the problem of inconsistent semantic information in the forward and backward direction of the simultaneous order, and gets a score of 37.5 on BLEU-4. The superiority of this approach is experimentally demonstrated on the MSCOCO dataset.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Zhang, H., Ma, C., Jiang, Z., & Lian, J. (2023). Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s. IEEE Access, 11, 134–143. https://doi.org/10.1109/ACCESS.2022.3232508

Readers over time

Readers' Seniority

Professor / Associate Prof. 1

50%

PhD / Post grad / Masters / Doc 1

50%

Readers' Discipline

Computer Science 3

100%

Article Metrics

Mentions

News Mentions: 1

View details >

Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s

Abstract

Author supplied keywords

References Powered by Scopus

Deep residual learning for image recognition

Long Short-Term Memory

Microsoft COCO: Common objects in context

Cited by Powered by Scopus

GVA: guided visual attention approach for automatic image caption generation

NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioning

Human-object interaction detection based on cascade multi-scale transformer

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline

Article Metrics