Swin Transformer with Part-Level Tokenization for Occluded Person Re-identification-Reference-Cited by-同舟云学术

Swin Transformer with Part-Level Tokenization for Occluded Person Re-identification

Published:2024-07-05 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Mishra Ranjit¹,Mondal Arijit¹,Mathew Jimson¹

Affiliation:

1. Indian Institute of Technology Patna

Abstract

To determine whether a pedestrian of interest has been captured by another distinct camera across a network of non-overlapping cameras, or by the same camera at a distinct time, is known as the problem of person re-identification and is considered one of the most fascinating challenges in computer vision. When query image of person of interest gets concealed, blocked, obscured or obstructed, the issue becomes considerably more challenging. Termed as occluded person re-identification, this covers scenarios that are closer to the real world crowded scenarios such as market place, airports, commercial malls, university campuses etc. Using a combination of global pedestrian level information along with part-level local feature information has increasingly has been shown to be a successful strategy for dealing with occluded person re-identification as it captures fine grained information from the non-occluded visible part. This paper proposes a Swin Transformer with Part-Level Tokenization (SwinPLT) model that uses a Swin Transformer-based backbone enhanced with Singular Value Decomposition (SVD). The proposed model leverages the hierarchical representation learning capabilities of Swin Transformer, combined with SVD to extract uncorrelated local tokens. Our approach aims to enhance the model's discriminative ability by effectively handling occlusions in person images. Employing a combination of hard triplet loss and cross-entropy loss, the proposed SwinPLT surpasses the state-of-the-art results by at least 18.14% Rank1-accuracy and 17.28% mAP on the Occluded DukeMTMC-reID dataset. On the Occluded-ReID dataset, the proposed SwinPLT model outperforms the other alternative approaches by 9.06% Rank1-accuracy and 7.71% mAP. On P-DukeMTMC-reID dataset, our model shows an improvement of 1.7% Rank1-accuracy and 2.4% mAP, whereas on Partial-iLIDS, it shows an improvement of 11.8% Rank1-accuracy and 4.26% mAP. We will be making the code and the model publically available at https://github.com/Ranjitkm2007/SwinPLT

Publisher

Springer Science and Business Media LLC

Reference65 articles.

1. Xiaogang Wang (2013) Intelligent multi-camera video surveillance: A review. Pattern Recognit. Lett. 34: 3-19

2. D'Orazio, T. and Cicirelli, G. (2012) People re-identification and tracking from multiple cameras: A review. 10.1109/ICIP.2012.6467181, 1601-1604, , , 2012 19th IEEE International Conference on Image Processing

3. D'Angelo, Angela and Dugelay, Jean-Luc (2011) People re-identification in camera networks based on probabilistic color histograms. Proceedings of SPIE - The International Society for Optical Engineering : https://doi.org/10.1117/12.876453, 01

4. D'Orazio, T. and Mazzeo, P.L. and Spagnolo, P. (2009) Color Brightness Transfer Function evaluation for non overlapping multi camera tracking. 10.1109/ICDSC.2009.5289365, 1-6, , , 2009 Third ACM/IEEE International Conference on Distributed Smart Cameras (ICDSC)

5. Truong Cong, Dung-Nghi and Achard, Catherine and Khoudour, Louahdi (2010) People re-identification by classification of silhouettes based on sparse representation. 10.1109/IPTA.2010.5586809, 60-65, , , 2010 2nd International Conference on Image Processing Theory, Tools and Applications