Abstract
Image captioning (IC), bringing vision to language, has drawn extensive attention. Precisely describing visual relations between image objects is a key challenge in IC. We argue that the visual relations, that is geometric positions (i.e., distance and size) and semantic interactions (i.e., actions and possessives), indicate the mutual correlations between objects. Existing Transformer-based methods typically resort to geometric positions to enhance the representation of visual relations, yet only using the shallow geometric is unable to precisely cover the complex and actional correlations. In this paper, we propose to enhance the correlations between objects from a comprehensive view that jointly considers explicit semantic and geometric relations, generating plausible captions with accurate relationship predictions. Specifically, we propose a novel Enhanced-Adaptive Relation Self-Attention Network (ER-SAN). We design the direction-sensitive semantic-enhanced attention, which considers content objects to semantic relations and semantic relations to content objects attention to learn explicit semantic-aware relations. Further, we devise an adaptive re-weight relation module that determines how much semantic and geometric attention should be activated to each relation feature. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our ER-SAN, with improvements of CIDEr from 128.6% to 135.3%, achieving state-of-the-art performance. Codes will be released https://github.com/CrossmodalGroup/ER-SAN.
Cite
CITATION STYLE
Li, J., Mao, Z., Fang, S., & Li, H. (2022). ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning. In IJCAI International Joint Conference on Artificial Intelligence (pp. 1081–1087). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2022/151
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.