Visual Question Answering Models for Zero-Shot Pedestrian Attribute Recognition: A Comparative Study-Reference-Cited by-同舟云学术

Visual Question Answering Models for Zero-Shot Pedestrian Attribute Recognition: A Comparative Study

Published:2024-06-28 Issue:6 Volume:5 Page:
ISSN:2661-8907
Container-title:SN Computer Science
language:en
Short-container-title:SN COMPUT. SCI.

Author:

Castrillón-Santana Modesto^ORCID,Sánchez-Nielsen Elena,Freire-Obregón David,Santana Oliverio J.,Hernández-Sosa Daniel,Lorenzo-Navarro Javier

Abstract

AbstractPedestrian Attribute Recognition (PAR) poses a significant challenge in developing automatic systems that enhance visual surveillance and human interaction. In this study, we investigate using Visual Question Answering (VQA) models to address the zero-shot PAR problem. Inspired by the impressive results achieved by a zero-shot VQA strategy during the PAR Contest at the 20th International Conference on Computer Analysis of Images and Patterns in 2023, we conducted a comparative study across three state-of-the-art VQA models, two of them based on BLIP-2 and the third one based on the Plug-and-Play VQA framework. Our analysis focuses on performance, robustness, contextual question handling, processing time, and classification errors. Our findings demonstrate that both BLIP-2-based models are better suited for PAR, with nuances related to the adopted frozen Large Language Model. Specifically, the Open Pre-trained Transformers based model performs well in benchmark color estimation tasks, while FLANT5XL provides better results for the considered binary tasks. In summary, zero-shot PAR based on VQA models offers highly competitive results, with the advantage of avoiding training costs associated with multipurpose classifiers.

Funder

Ministerio de Ciencia e Innovación

Agencia Canaria de Investigación, Innovación y Sociedad de la Información

Universidad de las Palmas de Gran Canaria

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s42979-024-02985-0.pdf

Reference25 articles.

1. Jain AK, Dass SC, Nandakumar K. Soft biometric traits for personal recognition systems. In: International conference on biometric authentication. Berlin, Heidelberg: Springer; 2004. p. 731–8.

2. Kumar N, Berg AC, Belhumeur PN, Nayar SK. Describable visual attributes for face verification and image search. IEEE Trans Pattern Anal Mach Intell. 2011;33(10):1962–77.

3. Dietlmeier J, Antony J, Mcguinness K, O’Connor NE. How important are faces for person re identification? In: Proceedings international conference on pattern recognition. Milan: IEEE Computer Society; 2020.

4. Cheng Z, Zhu X, Gong S. Face re-identification challenge: are face recognition models good enough? Pattern Recognit. 2020;107:107422.

5. Li S, Xiao T, Li H, Zhou B, Yue D, Wang X. Person search with natural language description. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR); 2017. p. 5187–96. https://doi.org/10.1109/CVPR.2017.551.