Improving Intra- And Inter-Modality Visual Relation for Image Captioning

Yong Wang; Wen Kai Zhang; Qing Liu; Zhengyuan Zhang; Xin Gao; Xian Sun

Conference ProceedingsOPEN ACCESS

Improving Intra- And Inter-Modality Visual Relation for Image Captioning

MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (2020) 4190-4198

DOI: 10.1145/3394171.3413877

28Citations

24Readers

Get full text

Abstract

It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed I2RT. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy"test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.

Author supplied keywords

Cite

CITATION STYLE

APA

Wang, Y., Zhang, W. K., Liu, Q., Zhang, Z., Gao, X., & Sun, X. (2020). Improving Intra- And Inter-Modality Visual Relation for Image Captioning. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (pp. 4190–4198). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3413877

Improving Intra- And Inter-Modality Visual Relation for Image Captioning

Abstract

Author supplied keywords

Cite

Register to see more suggestions