Abstract
To determine whether a pedestrian of interest has been captured by another distinct camera across a network of non-overlapping cameras, or by the same camera at a distinct time, is known as the problem of person re-identification and is considered one of the most fascinating challenges in computer vision. When query image of person of interest gets concealed, blocked, obscured or obstructed, the issue becomes considerably more challenging. Termed as occluded person re-identification, this covers scenarios that are closer to the real world crowded scenarios such as market place, airports, commercial malls, university campuses etc. Using a combination of global pedestrian level information along with part-level local feature information has increasingly has been shown to be a successful strategy for dealing with occluded person re-identification as it captures fine grained information from the non-occluded visible part.
This paper proposes a Swin Transformer with Part-Level Tokenization (SwinPLT) model that uses a Swin Transformer-based backbone enhanced with Singular Value Decomposition (SVD). The proposed model leverages the hierarchical representation learning capabilities of Swin Transformer, combined with SVD to extract uncorrelated local tokens. Our approach aims to enhance the model's discriminative ability by effectively handling occlusions in person images. Employing a combination of hard triplet loss and cross-entropy loss, the proposed SwinPLT surpasses the state-of-the-art results by at least 18.14% Rank1-accuracy and 17.28% mAP on the Occluded DukeMTMC-reID dataset. On the Occluded-ReID dataset, the proposed SwinPLT model outperforms the other alternative approaches by 9.06% Rank1-accuracy and 7.71% mAP. On P-DukeMTMC-reID dataset, our model shows an improvement of 1.7% Rank1-accuracy and 2.4% mAP, whereas on Partial-iLIDS, it shows an improvement of 11.8% Rank1-accuracy and 4.26% mAP. We will be making the code and the model publically available at https://github.com/Ranjitkm2007/SwinPLT