Weakly-supervised semantic segmentation (WSSS) methods via transformer have been actively studied by leveraging their strong capability to capture the global context. However, since the activation function only highlights a few tokens in the self-attention mechanism of the transformer, these methods still suffer from the sparse attention map, which leads to the generation of incomplete pseudo labels. In this paper, we propose a novel class activation scheme that is able to uniformly highlight the whole object region. The key idea of the proposed method is to activate the object region by following the guide of clusters, which are formed by combining similar image features of the object. Specifically, the clustering-guided class activation map (ClusterCAM) is generated from the proposed clustering-based attention module, and highly responsive regions in this map are then adopted to activate target objects in the encoded feature space. This helps the model to explore the entire region of the target object by using the semantic proximity between patch tokens extracted from the same object. Based on this, we design an end-to-end WSSS framework that can simultaneously train classification and segmentation networks in a single-stage manner. Experimental results on benchmark datasets show that our proposed method significantly outperforms previous WSSS methods, including several multi-stage approaches. The code and model are publicly available at: https://github.com/DCVL-WSSS/ClusterCAM.
CITATION STYLE
Kim, Y. W., & Kim, W. (2024). Clustering-Guided Class Activation for Weakly Supervised Semantic Segmentation. IEEE Access, 12, 4871–4880. https://doi.org/10.1109/ACCESS.2024.3350176
Mendeley helps you to discover research relevant for your work.