Affiliation:
1. School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China
2. Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi 830017, China
Abstract
Remote-sensing image captioning (RSIC) aims to generate descriptive sentences for ages by capturing both local and global semantic information. This task is challenging due to the diverse object types and varying scenes in ages. To address these challenges, we propose a positional-channel semantic fusion transformer (PCSFTr). The PCSFTr model employs scene classification to initially extract visual features and learn semantic information. A novel positional-channel multi-headed self-attention (PCMSA) block captures spatial and channel dependencies simultaneously, enriching the semantic information. The feature fusion (FF) module further enhances the understanding of semantic relationships. Experimental results show that PCSFTr significantly outperforms existing methods. Specifically, the BLEU-4 index reached 78.42% in UCM-caption, 54.42% in RSICD, and 69.01% in NWPU-captions. This research provides new insights into RSIC by offering a more comprehensive understanding of semantic information and relationships within images and improving the performance of image captioning models.
Funder
National Key Research and Development Program of China
Key Research and Development Program of the Autonomous Region
National Natural Science Foundation of China
Tianshan Science and Technology Innovation Leading talent Project of the Autonomous Region
Central guidance for local special projects
Reference45 articles.
1. Zhao, D., Shao, F., Liu, Q., Yang, L., Zhang, H., and Zhang, Z. (2024). A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sens., 16.
2. Zhou, N., Hong, J., Cui, W., Wu, S., and Zhang, Z. (2024). A Multiscale Attention Segment Network-Based Semantic Segmentation Model for Landslide Remote Sensing Images. Remote Sens., 16.
3. SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification;Lv;IEEE Trans. Geosci. Remote Sens.,2022
4. Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6–8). Deep Semantic Understanding of High Resolution Remote Sensing Image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China.
5. Exploring models and data for remote sensing image caption generation;Lu;IEEE Trans. Geosci. Remote Sens.,2017