Image captioning task is highly used in many real-world applications. The captioning task is concerned with understanding the image using computer vision methods. Then, natural language processing methods are used to produce a description for the image. Different approaches were proposed to solve this task, and deep learning attention-based models have been proven to be the state-of-the-art. A survey on attention-based models for image captioning is presented in this paper including new categories that were not included in other survey papers. The attention-based approaches are classified into four main categories, further classified into subcategories. All categories and subcategories of the attention-based approaches are discussed in detail. Furthermore, the state-of-the-art approaches are compared and the accuracy improvements are stated especially in the transformer-based models, and a summary of the benchmark datasets and the main performance metrics is presented
CITATION STYLE
Osman, A. A. E., Shalaby, M. A. W., Soliman, M. M., & Elsayed, K. M. (2023). A Survey on Attention-Based Models for Image Captioning. International Journal of Advanced Computer Science and Applications, 14(2), 403–412. https://doi.org/10.14569/IJACSA.2023.0140249
Mendeley helps you to discover research relevant for your work.