Abstract
Visible-infrared person re-identification (VI-ReID) is a significant and intricate endeavor in specific person retrieval, requiring the fusion of distinct features observed in visible and infrared modalities. To address the limitations of current methods, which predominantly use simple Convolutional Neural Network (CNN) structures as the backbone, leading to spatial information loss during training and complicating cross-modal feature alignment, we propose a novel approach using Swin-TransformerV2 as the backbone and staged feature mapping optimization learning for VI-ReID. Firstly, we introduce a new Ratio Center Difference Loss (RCD) to address the scattering of positive samples from different modalities in feature space, and we devise a Cross-modal Intra-class Denoising Loss (CID) which dynamically calculates the average distance between positive and negative samples to strengthen the differences between classes and adjust the feature space in different stages. Additionally, to accommodate the latest backbone models during the training phase, we design a Staged Modality-shared Loss Scheduler (SMS). Finally, our method introduces Channel Hybrid Filling Module (CHF), which enriches datasets and mitigates low-level modal discrepancies. After conducting numerous experiments on the SYSU-MM01 and RegDB datasets, it has been proven that our proposed method surpasses the current forefront methods in visible-infrared person re-identification.