Affiliation:
1. School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China
2. School of Integrated Circuits, Guangdong University of Technology, Guangzhou 510006, China
3. School of Automation, Guangdong University of Technology, Guangzhou 510006, China
Abstract
Infrared and visible image fusion integrates complementary information from different modalities into a single image, providing sufficient imaging information for scene interpretation and downstream target recognition tasks. However, existing fusion methods often focus only on highlighting salient targets or preserving scene details, failing to effectively combine entire features from different modalities during the fusion process, resulting in underutilized features and poor overall fusion effects. To address these challenges, a global and local four-branch feature extraction image fusion network (GLFuse) is proposed. On one hand, the Super Token Transformer (STT) block, which is capable of rapidly sampling and predicting super tokens, is utilized to capture global features in the scene. On the other hand, a Detail Extraction Block (DEB) is developed to extract local features in the scene. Additionally, two feature fusion modules, namely the Attention-based Feature Selection Fusion Module (ASFM) and the Dual Attention Fusion Module (DAFM), are designed to facilitate selective fusion of features from different modalities. Of more importance, the various perceptual information of feature maps learned from different modality images at the different layers of a network is investigated to design a perceptual loss function to better restore scene detail information and highlight salient targets by treating the perceptual information separately. Extensive experiments confirm that GLFuse exhibits excellent performance in both subjective and objective evaluations. It deserves note that GLFuse effectively improves downstream target detection performance on a unified benchmark.
Funder
GuangDong Basic and Applied Basic Research Foundation
National Natural Science Foundation of China
Guangzhou Municipal Science and Technology
Reference77 articles.
1. Multi-focus image fusion based on multi-scale gradients and image matting;Chen;IEEE Trans. Multimed.,2021
2. Deep video-based person re-identification (Deep Vid-ReID): Comprehensive survey;Saad;EURASIP J. Adv. Signal Process.,2024
3. Decision-level fusion detection method of visible and infrared images under low light conditions;Hu;EURASIP J. Adv. Signal Process.,2023
4. Qin, X., Zhang, Z., Huang, C., Gao, C., Dehghan, M., and Jagersand, M. (2019, January 15–20). Basnet: Boundary-Aware Salient Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.
5. TIRNet: Object detection in thermal infrared images for autonomous driving;Dai;Appl. Intell.,2021