Multi-Scale Cross-Modal Spatial Attention Fusion for Multi-label Image Recognition

6Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Multi-label image recognition aims to jointly predict multiple tags for an image. Despite great progress achieved, there are still two limitations for existing methods: 1) can not accurately locate the object regions due to the lack of adequate supervision information or semantic guidance; 2) can not effectively identify the target categories of small-size object due to only employing the high-level feature of deep CNN. In this paper, we propose a Multi-Scale Cross-Modal Spatial Attention Fusion (MCSAF) network to accurately locate more informative regions by introducing a spatial attention module, and our model can effectively recognize target classes of different scales with multi-scale cross-modal feature fusion. Furthermore, we develop an adaptive graph convolutional network (Adaptive-GCN) to capture the complex correlations among labels in depth. Empirical studies on benchmark datasets validate the superiority of our proposed model over state-of-the-art methods.

Cite

CITATION STYLE

APA

Li, J., Zhang, C., Wang, X., & Du, L. (2020). Multi-Scale Cross-Modal Spatial Attention Fusion for Multi-label Image Recognition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12396 LNCS, pp. 736–747). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-61609-0_58

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free