Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

Roberto Castro; Israel Pineda; Wansu Lim; Manuel Eugenio Morocho-Cayamcela

Journal ArticleOPEN ACCESS

Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

IEEE Access (2022) 10 33679-33694

DOI: 10.1109/ACCESS.2022.3161428

30Citations

79Readers

Abstract

This paper focuses on visual attention, a state-of-the-art approach for image captioning tasks within the computer vision research area. We study the impact that different hyperparemeter configurations on an encoder-decoder visual attention architecture in terms of efficiency. Results show that the correct selection of both the cost function and the gradient-based optimizer can significantly impact the captioning results. Our system considers the cross-entropy, Kullback-Leibler divergence, mean squared error, and negative log-likelihood loss functions; the adaptive momentum (Adam), AdamW, RMSprop, stochastic gradient descent, and Adadelta optimizers. Experimentation shows that a combination of cross-entropy with Adam is the best alternative returning a Top-5 accuracy value of 73.092 and a BLEU-4 value of 20.10. Furthermore, a comparative analysis of alternative convolutional architectures demonstrated their performance as an encoder. Our results show that ResNext-101 stands out with a Top-5 accuracy of 73.128 and a BLEU-4 of 19.80; positioning itself as the best option when looking for the optimum captioning quality. However, MobileNetV3 proved to be a much more compact alternative with 2,971,952 parameters and 0.23 Giga fixed-point Multiply-Accumulate operations per Second (GMACS). Consequently, MobileNetV3 offers a competitive output quality at the cost of lower computational performance, supported by values of 19.50 and 72.928 for the BLEU-4 and Top-5 accuracy, respectively. Finally, when testing vision transformer (ViT), and data-efficient image transformer (DeiT) models to replace the convolutional component of the architecture, DeiT achieved an improvement over ViT, obtaining a value of 34.44 in the BLEU-4 metric.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Castro, R., Pineda, I., Lim, W., & Morocho-Cayamcela, M. E. (2022). Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks. IEEE Access, 10, 33679–33694. https://doi.org/10.1109/ACCESS.2022.3161428

Readers over time

Readers' Seniority

PhD / Post grad / Masters / Doc 7

47%

Lecturer / Post doc 5

33%

Professor / Associate Prof. 2

13%

Researcher 1

Readers' Discipline

Computer Science 14

64%

Engineering 6

27%

Biochemistry, Genetics and Molecular Bi... 1

Mathematics 1

Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

Abstract

Author supplied keywords

References Powered by Scopus

Deep residual learning for image recognition

Long Short-Term Memory

ImageNet: A Large-Scale Hierarchical Image Database

Cited by Powered by Scopus

Identification of surface defects on solar PV panels and wind turbine blades using attention based deep learning model

A Study of ConvNeXt Architectures for Enhanced Image Captioning

Transformers Meet Small Datasets

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline