Affiliation:
1. School of Rail Transportation, Shandong Jiaotong University, Jinan 250357, China
2. State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China
3. CRSC Research and Design Institute Group Co., Ltd., Beijing 100070, China
Abstract
Multi-modal image fusion is a methodology that combines image features from multiple types of sensors, effectively improving the quality and content of fused images. However, most existing deep learning fusion methods need to integrate global or local features, restricting the representation of feature information. To address this issue, a hybrid densely connected CNN and transformer (HDCCT) fusion framework is proposed. In the proposed HDCCT framework, the network of the CNN-based blocks obtain the local structure of the input data, and the transformer-based blocks obtain the global structure of the original data, significantly improving the feature representation. In the fused image, the proposed encoder–decoder architecture is designed for both the CNN and transformer blocks to reduce feature loss while preserving the characterization of all-level features. In addition, the cross-coupled framework facilitates the flow of feature structures, retains the uniqueness of information, and makes the transform model long-range dependencies based on the local features already extracted by the CNN. Meanwhile, to retain the information in the source images, the hybrid structural similarity (SSIM) and mean square error (MSE) loss functions are introduced. The qualitative and quantitative comparisons of grayscale images with infrared and visible image fusion indicate that the suggested method outperforms related works.