ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

Jingyu Li; Zhendong Mao; Shancheng Fang; Hao Li

Conference ProceedingsOPEN ACCESS

ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

IJCAI International Joint Conference on Artificial Intelligence (2022) 1081-1087

DOI: 10.24963/ijcai.2022/151

17Citations

6Readers

Abstract

Image captioning (IC), bringing vision to language, has drawn extensive attention. Precisely describing visual relations between image objects is a key challenge in IC. We argue that the visual relations, that is geometric positions (i.e., distance and size) and semantic interactions (i.e., actions and possessives), indicate the mutual correlations between objects. Existing Transformer-based methods typically resort to geometric positions to enhance the representation of visual relations, yet only using the shallow geometric is unable to precisely cover the complex and actional correlations. In this paper, we propose to enhance the correlations between objects from a comprehensive view that jointly considers explicit semantic and geometric relations, generating plausible captions with accurate relationship predictions. Specifically, we propose a novel Enhanced-Adaptive Relation Self-Attention Network (ER-SAN). We design the direction-sensitive semantic-enhanced attention, which considers content objects to semantic relations and semantic relations to content objects attention to learn explicit semantic-aware relations. Further, we devise an adaptive re-weight relation module that determines how much semantic and geometric attention should be activated to each relation feature. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our ER-SAN, with improvements of CIDEr from 128.6% to 135.3%, achieving state-of-the-art performance. Codes will be released https://github.com/CrossmodalGroup/ER-SAN.

Cite

CITATION STYLE

APA

Li, J., Mao, Z., Fang, S., & Li, H. (2022). ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning. In IJCAI International Joint Conference on Artificial Intelligence (pp. 1081–1087). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2022/151

ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

Abstract

Cite

Register to see more suggestions