Multi-Granularity Aggregation with Spatiotemporal Consistency for Video-Based Person Re-Identification-Reference-Cited by-同舟云学术

Multi-Granularity Aggregation with Spatiotemporal Consistency for Video-Based Person Re-Identification

Published:2024-03-30 Issue:7 Volume:24 Page:2229
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Lee Hean Sung¹^ORCID,Kim Minjung¹^ORCID,Jang Sungjun¹^ORCID,Bae Han Byeol²^ORCID,Lee Sangyoun¹

Affiliation:

1. School of Electrical and Electronic Engineering, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 03722, Republic of Korea

2. School of Computer Science and Engineering, Kunsan National University, 558 Daehak-ro, Gunsan-si 54150, Republic of Korea

Abstract

Video-based person re-identification (ReID) aims to exploit relevant features from spatial and temporal knowledge. Widely used methods include the part- and attention-based approaches for suppressing irrelevant spatial–temporal features. However, it is still challenging to overcome inconsistencies across video frames due to occlusion and imperfect detection. These mismatches make temporal processing ineffective and create an imbalance of crucial spatial information. To address these problems, we propose the Spatiotemporal Multi-Granularity Aggregation (ST-MGA) method, which is specifically designed to accumulate relevant features with spatiotemporally consistent cues. The proposed framework consists of three main stages: extraction, which extracts spatiotemporally consistent partial information; augmentation, which augments the partial information with different granularity levels; and aggregation, which effectively aggregates the augmented spatiotemporal information. We first introduce the consistent part-attention (CPA) module, which extracts spatiotemporally consistent and well-aligned attentive parts. Sub-parts derived from CPA provide temporally consistent semantic information, solving misalignment problems in videos due to occlusion or inaccurate detection, and maximize the efficiency of aggregation through uniform partial information. To enhance the diversity of spatial and temporal cues, we introduce the Multi-Attention Part Augmentation (MA-PA) block, which incorporates fine parts at various granular levels, and the Long-/Short-term Temporal Augmentation (LS-TA) block, designed to capture both long- and short-term temporal relations. Using densely separated part cues, ST-MGA fully exploits and aggregates the spatiotemporal multi-granular patterns by comparing relations between parts and scales. In the experiments, the proposed ST-MGA renders state-of-the-art performance on several video-based ReID benchmarks (i.e., MARS, DukeMTMC-VideoReID, and LS-VID).

Publisher

MDPI AG

Link

https://www.mdpi.com/1424-8220/24/7/2229/pdf

Reference73 articles.

1. Liu, K., Ma, B., Zhang, W., and Huang, R. (2015, January 7–13). A spatio-temporal appearance representation for viceo-based pedestrian re-identification. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.

2. Zheng, L., Bie, Z., Sun, Y., Wang, J., Su, C., Wang, S., and Tian, Q. (2016, January 11–14). Mars: A video benchmark for large-scale person re-identification. Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands. Proceedings, Part VI 14.

3. Hierarchical integration of rich features for video-based person re-identification;Liu;IEEE Trans. Circuits Syst. Video Technol.,2018

4. Kim, M., Cho, M., and Lee, S. (2023, January 3–7). Feature Disentanglement Learning with Switching and Aggregation for Video-based Person Re-Identification. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.

5. Chen, D., Li, H., Xiao, T., Yi, S., and Wang, X. (2018, January 18–22). Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.