Capturing image feature and multi-object region of an image and transferring it into a Natural Language Sentence is a research issue needs to be addressed with natural language processing. Technically, the attention mechanism will force every word representation to an corresponding image region, however at times it do neglect certain words like ‘the’ in the description text, as it misleads the text interpretation. The captioning of an image involves not only detecting the features from various images, but also decoding the collaborations between the items into significant image text. The focus of the suggested work, predicts the image sentence in a more detailed way for every region/frame of an image. To overcome, an image feature extraction is carried out using CNN and LSTM for the image text generation with the help of adaptive attention mechanism, which will be add in the layer of LSTM to predict better image sentence is constructed. The above mentioned deep network methods have been analyzed using two output combination. Experiments have been implemented using Flickr8k dataset. The implementation analysis illustrates that adaptive attention performs significantly better than without adaptive attention of image sentence model and generates more meaningful captions compared to any of the individual models used. From the results on test images, the suggested network gives the accuracy, bleu score with and without using adaptive attention in the LSTM of 81.53, 61.94 and 73.53, 57.94%.
CITATION STYLE
Vidhya, K. A., Krishnakumar, S., & Cynddia, B. (2023). Adaptive Multi-attention for Image Sentence Generator Using C-LSTM. In Lecture Notes in Networks and Systems (Vol. 448, pp. 579–592). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-19-1610-6_51
Mendeley helps you to discover research relevant for your work.