Affiliation:
1. School of Automation, Northwestern Polytechnical University, Xi’an 710072, China
2. Electronic and Computer Engineering, Shenzhen Graduate School, Peking University, Shenzhen 518055, China
3. Radar Research Laboratory, School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
Abstract
Transformers have recently gained significant attention in low-level vision tasks, particularly for remote sensing image super-resolution (RSISR). The vanilla vision transformer aims to establish long-range dependencies between image patches. However, its global receptive field leads to a quadratic increase in computational complexity with respect to spatial size, rendering it inefficient for addressing RSISR tasks that involve processing large-sized images. In an effort to mitigate computational costs, recent studies have explored the utilization of local attention mechanisms, inspired by convolutional neural networks (CNNs), focusing on interactions between patches within small windows. Nevertheless, these approaches are naturally influenced by smaller participating receptive fields, and the utilization of fixed window sizes hinders their ability to perceive multi-scale information, consequently limiting model performance. To address these challenges, we propose a hierarchical transformer model named the Multi-Scale and Global Representation Enhancement-based Transformer (MSGFormer). We propose an efficient attention mechanism, Dual Window-based Self-Attention (DWSA), combining distributed and concentrated attention to balance computational complexity and the receptive field range. Additionally, we incorporated the Multi-scale Depth-wise Convolution Attention (MDCA) module, which is effective in capturing multi-scale features through multi-branch convolution. Furthermore, we developed a new Tracing-Back Structure (TBS), offering tracing-back mechanisms for both proposed attention modules to enhance their feature representation capability. Extensive experiments demonstrate that MSGFormer outperforms state-of-the-art methods on multiple public RSISR datasets by up to 0.11–0.55 dB.
Funder
Postdoctoral Science Foundation of China
Shaanxi Science Fund for Distinguished Young Scholars
Basic and Applied Basic Research Foundation of Guangdong Province
Reference73 articles.
1. SpectralGPT: Spectral remote sensing foundation model;Hong;IEEE Trans. Pattern Anal. Mach. Intell.,2024
2. Towards Integrity and Detail with Ensemble Learning for Salient Object Detection in Optical Remote Sensing Images;Liu;IEEE Trans. Geosci. Remote Sens.,2024
3. Deep learning-based harmonization and super-resolution of Landsat-8 and Sentinel-2 images;Sambandham;ISPRS J. Photogramm. Remote Sens.,2024
4. Wang, Y., Yuan, W., Xie, F., and Lin, B. (2024). ESatSR: Enhancing Super-Resolution for Satellite Remote Sensing Images with State Space Model and Spatial Context. Remote Sens., 16.
5. A novel fusion framework embedded with zero-shot super-resolution and multivariate autoregression for precipitable water vapor across the continental Europe;Wu;Remote Sens. Environ.,2023