Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion-Reference-Cited by-同舟云学术

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Published:2024-09-11 Issue:18 Volume:13 Page:3605
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Zhao An¹,Yang Wenzhong¹²^ORCID,Chen Danny¹,Wei Fuyuan¹

Affiliation:

1. School of Computer Science and Technology, Xinjiang University, Urumqi 830017, China

2. Xinjiang Key Laboratory of Multilingual Information Technology, Xinjiang University, Urumqi 830017, China

Abstract

Remote-sensing image captioning (RSIC) aims to generate descriptive sentences for ages by capturing both local and global semantic information. This task is challenging due to the diverse object types and varying scenes in ages. To address these challenges, we propose a positional-channel semantic fusion transformer (PCSFTr). The PCSFTr model employs scene classification to initially extract visual features and learn semantic information. A novel positional-channel multi-headed self-attention (PCMSA) block captures spatial and channel dependencies simultaneously, enriching the semantic information. The feature fusion (FF) module further enhances the understanding of semantic relationships. Experimental results show that PCSFTr significantly outperforms existing methods. Specifically, the BLEU-4 index reached 78.42% in UCM-caption, 54.42% in RSICD, and 69.01% in NWPU-captions. This research provides new insights into RSIC by offering a more comprehensive understanding of semantic information and relationships within images and improving the performance of image captioning models.

Funder

National Key Research and Development Program of China

Key Research and Development Program of the Autonomous Region

National Natural Science Foundation of China

Tianshan Science and Technology Innovation Leading talent Project of the Autonomous Region

Central guidance for local special projects

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/18/3605/pdf

Reference45 articles.

1. Zhao, D., Shao, F., Liu, Q., Yang, L., Zhang, H., and Zhang, Z. (2024). A Small Object Detection Method for Drone-Captured Images Based on Improved YOLOv7. Remote Sens., 16.

2. Zhou, N., Hong, J., Cui, W., Wu, S., and Zhang, Z. (2024). A Multiscale Attention Segment Network-Based Semantic Segmentation Model for Landslide Remote Sensing Images. Remote Sens., 16.

3. SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification;Lv;IEEE Trans. Geosci. Remote Sens.,2022

4. Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6–8). Deep Semantic Understanding of High Resolution Remote Sensing Image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China.

5. Exploring models and data for remote sensing image caption generation;Lu;IEEE Trans. Geosci. Remote Sens.,2017