Drone-Based Visible–Thermal Object Detection with Transformers and Prompt Tuning
Author:
Chen Rui1ORCID, Li Dongdong1, Gao Zhinan1, Kuai Yangliu2, Wang Chengyuan3
Affiliation:
1. College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China 2. College of Intelligent Science and Technology, National University of Defense Technology, Changsha 410073, China 3. Information and Communication College, National University of Defense Technology, Wuhan 430010, China
Abstract
The use of unmanned aerial vehicles (UAVs) for visible–thermal object detection has emerged as a powerful technique to improve accuracy and resilience in challenging contexts, including dim lighting and severe weather conditions. However, most existing research relies on Convolutional Neural Network (CNN) frameworks, limiting the application of the Transformer’s attention mechanism to mere fusion modules and neglecting its potential for comprehensive global feature modeling. In response to this limitation, this study introduces an innovative dual-modal object detection framework called Visual Prompt multi-modal Detection (VIP-Det) that harnesses the Transformer architecture as the primary feature extractor and integrates vision prompts for refined feature fusion. Our approach begins with the training of a single-modal baseline model to solidify robust model representations, which is then refined through fine-tuning that incorporates additional modal data and prompts. Tests on the DroneVehicle dataset show that our algorithm achieves remarkable accuracy, outperforming comparable Transformer-based methods. These findings indicate that our proposed methodology marks a significant advancement in the realm of UAV-based object detection, holding significant promise for enhancing autonomous surveillance and monitoring capabilities in varied and challenging environments.
Funder
National Natural Science Foundation of China scientific research project of National University of Defense Technology
Reference36 articles.
1. He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27–30). Deep Residual Learning for Image Recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA. 2. Girshick, R. (2015, January 7–13). Fast r-cnn. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile. 3. Faster r-cnn: Towards real-time object detection with region proposal networks;Ren;IEEE Trans. Pattern Anal. Mach. Intell.,2017 4. Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016, January 27–30). You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 5. Du, D., Zhu, P., Wen, L., Bian, X., and Liu, Z.M. (2019, January 27–28). VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. Proceedings of the ICCV Visdrone Workshop, Seoul, Republic of Korea.
|
|