Multi-label image recognition aims to jointly predict multiple tags for an image. Despite great progress achieved, there are still two limitations for existing methods: 1) can not accurately locate the object regions due to the lack of adequate supervision information or semantic guidance; 2) can not effectively identify the target categories of small-size object due to only employing the high-level feature of deep CNN. In this paper, we propose a Multi-Scale Cross-Modal Spatial Attention Fusion (MCSAF) network to accurately locate more informative regions by introducing a spatial attention module, and our model can effectively recognize target classes of different scales with multi-scale cross-modal feature fusion. Furthermore, we develop an adaptive graph convolutional network (Adaptive-GCN) to capture the complex correlations among labels in depth. Empirical studies on benchmark datasets validate the superiority of our proposed model over state-of-the-art methods.
CITATION STYLE
Li, J., Zhang, C., Wang, X., & Du, L. (2020). Multi-Scale Cross-Modal Spatial Attention Fusion for Multi-label Image Recognition. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12396 LNCS, pp. 736–747). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-61609-0_58
Mendeley helps you to discover research relevant for your work.