VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning

27Citations
Citations of this article
19Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

In the field of satellite imagery, remote sensing image captioning (RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote (DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention (VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a cross-modal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-to-end Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.

Cite

CITATION STYLE

APA

Wei, T., Yuan, W., Luo, J., Zhang, W., & Lu, L. (2023). VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning. Journal of Systems Engineering and Electronics, 34(1), 9–18. https://doi.org/10.23919/JSEE.2023.000035

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free