Hyperparameter Tuning over an Attention Model for Image Captioning

3Citations
Citations of this article
9Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Considering the historical trajectory and evolution of image captioning as a research area, this paper focuses on visual attention as an approach to solve captioning tasks with computer vision. This article studies the efficiency of different hyperparameter configurations on a state-of-the-art visual attention architecture composed of a pre-trained residual neural network encoder, and a long short-term memory decoder. Results show that the selection of both the cost function and the gradient-based optimizer have a significant impact on the captioning results. Our system considers the cross-entropy, Kullback-Leibler divergence, mean squared error, and the negative log-likelihood loss functions, as well as the adaptive momentum, AdamW, RMSprop, stochastic gradient descent, and Adadelta optimizers. Based on the performance metrics, a combination of cross-entropy with Adam is identified as the best alternative returning a Top-5 accuracy value of 73.092, and a BLEU-4 value of 0.201. Setting the cross-entropy as an independent variable, the first two optimization alternatives prove the best performance with a BLEU-4 metric value of 0.201. In terms of the inference loss, Adam outperforms AdamW with 3.413 over 3.418 and a Top-5 accuracy of 73.092 over 72.989.

Cite

CITATION STYLE

APA

Castro, R., Pineda, I., & Morocho-Cayamcela, M. E. (2021). Hyperparameter Tuning over an Attention Model for Image Captioning. In Communications in Computer and Information Science (Vol. 1456 CCIS, pp. 172–183). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-89941-7_13

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free