Abstract
Weakly supervised semantic segmentation (WSSS) using image-level class labels is challenging due to the limitations of Class Activation Maps (CAMs) in convolutional neural networks (CNNs), which often highlight only the most discriminative image regions. We propose the Hierarchical Multi-Class Token Attention Network (HMCTANet), a novel approach leveraging a Conformer backbone that integrates CNN and Transformer branches. HMCTANet enhances CAMs through multi-class token attention and a Class-Aware Training (CAT) strategy that aligns class tokens with ground-truth labels. Additionally, we introduce a Class Token Regularization Module (CTRM) to improve the discriminative power of class tokens. Our Refinement Module (RM) further refines segmentation by combining class-specific attention and patch-level affinity from the Transformer branch with the CAMs from the CNN branch. HMCTANet achieves state-of-the-art performance, with mIoU scores of 69.0% and 68.4% on the PASCAL VOC 2012 validation and test sets, respectively, demonstrating the effectiveness of our approach for WSSS tasks.