VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning

Tingting Wei; Weilin Yuan; Junren Luo; Wanpeng Zhang; Lina Lu

Journal ArticleOPEN ACCESS

VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning

Journal of Systems Engineering and Electronics (2023) 34(1) 9-18

DOI: 10.23919/JSEE.2023.000035

27Citations

19Readers

Abstract

In the field of satellite imagery, remote sensing image captioning (RSIC) is a hot topic with the challenge of overfitting and difficulty of image and text alignment. To address these issues, this paper proposes a vision-language aligning paradigm for RSIC to jointly represent vision and language. First, a new RSIC dataset DIOR-Captions is built for augmenting object detection in optical remote (DIOR) sensing images dataset with manually annotated Chinese and English contents. Second, a Vision-Language aligning model with Cross-modal Attention (VLCA) is presented to generate accurate and abundant bilingual descriptions for remote sensing images. Third, a cross-modal learning network is introduced to address the problem of visual-lingual alignment. Notably, VLCA is also applied to end-to-end Chinese captions generation by using the pre-training language model of Chinese. The experiments are carried out with various baselines to validate VLCA on the proposed dataset. The results demonstrate that the proposed algorithm is more descriptive and informative than existing algorithms in producing captions.

Author supplied keywords

Cite

CITATION STYLE

APA

Wei, T., Yuan, W., Luo, J., Zhang, W., & Lu, L. (2023). VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning. Journal of Systems Engineering and Electronics, 34(1), 9–18. https://doi.org/10.23919/JSEE.2023.000035

VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning

Abstract

Author supplied keywords

Cite

Register to see more suggestions