Abstract
AbstractPedestrian Attribute Recognition (PAR) poses a significant challenge in developing automatic systems that enhance visual surveillance and human interaction. In this study, we investigate using Visual Question Answering (VQA) models to address the zero-shot PAR problem. Inspired by the impressive results achieved by a zero-shot VQA strategy during the PAR Contest at the 20th International Conference on Computer Analysis of Images and Patterns in 2023, we conducted a comparative study across three state-of-the-art VQA models, two of them based on BLIP-2 and the third one based on the Plug-and-Play VQA framework. Our analysis focuses on performance, robustness, contextual question handling, processing time, and classification errors. Our findings demonstrate that both BLIP-2-based models are better suited for PAR, with nuances related to the adopted frozen Large Language Model. Specifically, the Open Pre-trained Transformers based model performs well in benchmark color estimation tasks, while FLANT5XL provides better results for the considered binary tasks. In summary, zero-shot PAR based on VQA models offers highly competitive results, with the advantage of avoiding training costs associated with multipurpose classifiers.
Funder
Ministerio de Ciencia e Innovación
Agencia Canaria de Investigación, Innovación y Sociedad de la Información
Universidad de las Palmas de Gran Canaria
Publisher
Springer Science and Business Media LLC
Reference25 articles.
1. Jain AK, Dass SC, Nandakumar K. Soft biometric traits for personal recognition systems. In: International conference on biometric authentication. Berlin, Heidelberg: Springer; 2004. p. 731–8.
2. Kumar N, Berg AC, Belhumeur PN, Nayar SK. Describable visual attributes for face verification and image search. IEEE Trans Pattern Anal Mach Intell. 2011;33(10):1962–77.
3. Dietlmeier J, Antony J, Mcguinness K, O’Connor NE. How important are faces for person re identification? In: Proceedings international conference on pattern recognition. Milan: IEEE Computer Society; 2020.
4. Cheng Z, Zhu X, Gong S. Face re-identification challenge: are face recognition models good enough? Pattern Recognit. 2020;107:107422.
5. Li S, Xiao T, Li H, Zhou B, Yue D, Wang X. Person search with natural language description. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR); 2017. p. 5187–96. https://doi.org/10.1109/CVPR.2017.551.