MRG-T: Mask-Relation-Guided Transformer for Remote Vision-Based Pedestrian Attribute Recognition in Aerial Imagery-Reference-Cited by-同舟云学术

MRG-T: Mask-Relation-Guided Transformer for Remote Vision-Based Pedestrian Attribute Recognition in Aerial Imagery

Published:2024-03-29 Issue:7 Volume:16 Page:1216
ISSN:2072-4292
Container-title:Remote Sensing
language:en
Short-container-title:Remote Sensing

Author:

Zhang Shun¹^ORCID,Li Yupeng¹^ORCID,Wu Xiao¹,Chu Zunheng¹,Li Lingfei¹

Affiliation:

1. School of Electronic and Information, Northwestern Polytechnical University, Xi’an 710129, China

Abstract

Nowadays, with the rapid development of consumer Unmanned Aerial Vehicles (UAVs), utilizing UAV platforms for visual surveillance has become very attractive, and a key part of this is remote vision-based pedestrian attribute recognition. Pedestrian Attribute Recognition (PAR) is dedicated to predicting multiple attribute labels of a single pedestrian image extracted from surveillance videos and aerial imagery, which presents significant challenges in the computer vision community due to factors such as poor imaging quality and substantial pose variations. Despite recent studies demonstrating impressive advancements in utilizing complicated architectures and exploring relations, most of them may fail to fully and systematically consider the inter-region, inter-attribute, and region-attribute mapping relations simultaneously and be stuck in the dilemma of information redundancy, leading to the degradation of recognition accuracy. To address the issues, we construct a novel Mask-Relation-Guided Transformer (MRG-T) framework that consists of three relation modeling modules to fully exploit spatial and semantic relations in the model learning process. Specifically, we first propose a Masked Region Relation Module (MRRM) to focus on precise spatial attention regions to extract more robust features with masked random patch training. To explore the semantic association of attributes, we further present a Masked Attribute Relation Module (MARM) to extract intrinsic and semantic inter-attribute relations with an attribute label masking strategy. Based on the cross-attention mechanism, we finally design a Region and Attribute Mapping Module (RAMM) to learn the cross-modal alignment between spatial regions and semantic attributes. We conduct comprehensive experiments on three public benchmarks such as PETA, PA-100K, and RAPv1, and conduct inference on a large-scale airborne person dataset named PRAI-1581. The extensive experimental results demonstrate the superior performance of our method compared to state-of-the-art approaches and validate the effectiveness of mask-relation-guided modeling in the remote vision-based PAR task.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2072-4292/16/7/1216/pdf

Reference71 articles.

1. Wang, X., Zheng, S., Yang, R., Zheng, A., Chen, Z., Tang, J., and Luo, B. (2022). Pedestrian attribute recognition: A survey. Pattern Recognit., 121.

2. Schumann, A., and Stiefelhagen, R. (2017, January 21–26). Person re-identification by deep learning attribute-complementary information. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA.

3. Improving person re-identification by attribute and identity learning;Lin;Pattern Recognit.,2019

4. Zhu, Y., Wang, T., and Zhu, S. (2022). Adaptive Multi-Pedestrian Tracking by Multi-Sensor: Track-to-Track Fusion Using Monocular 3D Detection and MMW Radar. Remote Sens., 14.

5. Tracking persons-of-interest via unsupervised representation adaptation;Zhang;Int. J. Comput. Vision,2020