A Vision–Language Model-Based Traffic Sign Detection Method for High-Resolution Drone Images: A Case Study in Guyuan, China-Reference-Cited by-同舟云学术

A Vision–Language Model-Based Traffic Sign Detection Method for High-Resolution Drone Images: A Case Study in Guyuan, China

Published:2024-09-06 Issue:17 Volume:24 Page:5800
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Yao Jianqun¹,Li Jinming¹,Li Yuxuan¹,Zhang Mingzhu²,Zuo Chen²^ORCID,Dong Shi²^ORCID,Dai Zhe²^ORCID

Affiliation:

1. CCCC Infrastructure Maintenance Group Co., Ltd., Beijing 100011, China

2. School of Transportation Engineering, Chang’an University, Xi’an 710064, China

Abstract

As a fundamental element of the transportation system, traffic signs are widely used to guide traffic behaviors. In recent years, drones have emerged as an important tool for monitoring the conditions of traffic signs. However, the existing image processing technique is heavily reliant on image annotations. It is time consuming to build a high-quality dataset with diverse training images and human annotations. In this paper, we introduce the utilization of Vision–language Models (VLMs) in the traffic sign detection task. Without the need for discrete image labels, the rapid deployment is fulfilled by the multi-modal learning and large-scale pretrained networks. First, we compile a keyword dictionary to explain traffic signs. The Chinese national standard is used to suggest the shape and color information. Our program conducts Bootstrapping Language-image Pretraining v2 (BLIPv2) to translate representative images into text descriptions. Second, a Contrastive Language-image Pretraining (CLIP) framework is applied to characterize not only drone images but also text descriptions. Our method utilizes the pretrained encoder network to create visual features and word embeddings. Third, the category of each traffic sign is predicted according to the similarity between drone images and keywords. Cosine distance and softmax function are performed to calculate the class probability distribution. To evaluate the performance, we apply the proposed method in a practical application. The drone images captured from Guyuan, China, are employed to record the conditions of traffic signs. Further experiments include two widely used public datasets. The calculation results indicate that our vision–language model-based method has an acceptable prediction accuracy and low training cost.

Funder

Chinese Ministry of Transportation In Service Trunk Highway Infrastructure and Safety Emergency Digitization Project

Transportation Research Project of Department of Transport of Shaanxi Province

Publisher

MDPI AG

Link

https://www.mdpi.com/1424-8220/24/17/5800/pdf

Reference45 articles.

1. Canese, L., Cardarilli, G.C., Di Nunzio, L., Fazzolari, R., Famil Ghadakchi, H., Re, M., and Spanò, S. (2022). Sensing and Detection of Traffic Signs Using CNNs: An Assessment on Their Performance. Sensors, 22.

2. Sanyal, B., Mohapatra, R.K., and Dash, R. (2020, January 10–12). Traffic Sign Recognition: A Survey. Proceedings of the 2020 International Conference on Artificial Intelligence and Signal Processing (AISP), Amaravati, India.

3. Lim, X.R., Lee, C.P., Lim, K.M., Ong, T.S., Alqahtani, A., and Ali, M. (2023). Recent Advances in Traffic Sign Recognition: Approaches and Datasets. Sensors, 23.

4. DroneSegNet: Robust Aerial Semantic Segmentation for UAV-Based IoT Applications;Chakravarthy;IEEE Trans. Veh. Technol.,2022

5. From Global Challenges to Local Solutions: A Review of Cross-country Collaborations and Winning Strategies in Road Damage Detection;Arya;Adv. Eng. Inform.,2024