Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks

30Citations
Citations of this article
79Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

This paper focuses on visual attention, a state-of-the-art approach for image captioning tasks within the computer vision research area. We study the impact that different hyperparemeter configurations on an encoder-decoder visual attention architecture in terms of efficiency. Results show that the correct selection of both the cost function and the gradient-based optimizer can significantly impact the captioning results. Our system considers the cross-entropy, Kullback-Leibler divergence, mean squared error, and negative log-likelihood loss functions; the adaptive momentum (Adam), AdamW, RMSprop, stochastic gradient descent, and Adadelta optimizers. Experimentation shows that a combination of cross-entropy with Adam is the best alternative returning a Top-5 accuracy value of 73.092 and a BLEU-4 value of 20.10. Furthermore, a comparative analysis of alternative convolutional architectures demonstrated their performance as an encoder. Our results show that ResNext-101 stands out with a Top-5 accuracy of 73.128 and a BLEU-4 of 19.80; positioning itself as the best option when looking for the optimum captioning quality. However, MobileNetV3 proved to be a much more compact alternative with 2,971,952 parameters and 0.23 Giga fixed-point Multiply-Accumulate operations per Second (GMACS). Consequently, MobileNetV3 offers a competitive output quality at the cost of lower computational performance, supported by values of 19.50 and 72.928 for the BLEU-4 and Top-5 accuracy, respectively. Finally, when testing vision transformer (ViT), and data-efficient image transformer (DeiT) models to replace the convolutional component of the architecture, DeiT achieved an improvement over ViT, obtaining a value of 34.44 in the BLEU-4 metric.

References Powered by Scopus

Deep residual learning for image recognition

177130Citations
N/AReaders
Get full text

Long Short-Term Memory

77868Citations
N/AReaders
Get full text

ImageNet: A Large-Scale Hierarchical Image Database

52156Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Identification of surface defects on solar PV panels and wind turbine blades using attention based deep learning model

27Citations
N/AReaders
Get full text

A Study of ConvNeXt Architectures for Enhanced Image Captioning

19Citations
N/AReaders
Get full text

Transformers Meet Small Datasets

13Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Castro, R., Pineda, I., Lim, W., & Morocho-Cayamcela, M. E. (2022). Deep Learning Approaches Based on Transformer Architectures for Image Captioning Tasks. IEEE Access, 10, 33679–33694. https://doi.org/10.1109/ACCESS.2022.3161428

Readers over time

‘22‘23‘24‘25015304560

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 7

47%

Lecturer / Post doc 5

33%

Professor / Associate Prof. 2

13%

Researcher 1

7%

Readers' Discipline

Tooltip

Computer Science 14

64%

Engineering 6

27%

Biochemistry, Genetics and Molecular Bi... 1

5%

Mathematics 1

5%

Save time finding and organizing research with Mendeley

Sign up for free
0