Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning-Reference-Cited by-同舟云学术

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Published:2024-01-03 Issue:1 Volume:16 Page:196
ISSN:2072-4292
Container-title:Remote Sensing
language:en
Short-container-title:Remote Sensing

Author:

Li Zhengxin¹²³^ORCID,Zhao Wenzhe¹²,Du Xuanyi¹³,Zhou Guangyao¹²,Zhang Songlin¹²

Affiliation:

1. The Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

2. Key Laboratory of Spatial Information Processing and Application System Technology, Chinese Academy of Sciences, Beijing 100190, China

3. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 101408, China

Abstract

Two-stage remote sensing image captioning (RSIC) methods have achieved promising results by incorporating additional pre-trained remote sensing tasks to extract supplementary information and improve caption quality. However, these methods face limitations in semantic comprehension, as pre-trained detectors/classifiers are constrained by predefined labels, leading to an oversight of the intricate and diverse details present in remote sensing images (RSIs). Additionally, the handling of auxiliary remote sensing tasks separately can introduce challenges in ensuring seamless integration and alignment with the captioning process. To address these problems, we propose a novel cross-modal retrieval and semantic refinement (CRSR) RSIC method. Specifically, we employ a cross-modal retrieval model to retrieve relevant sentences of each image. The words in these retrieved sentences are then considered as primary semantic information, providing valuable supplementary information for the captioning process. To further enhance the quality of the captions, we introduce a semantic refinement module that refines the primary semantic information, which helps to filter out misleading information and emphasize visually salient semantic information. A Transformer Mapper network is introduced to expand the representation of image features beyond the retrieved supplementary information with learnable queries. Both the refined semantic tokens and visual features are integrated and fed into a cross-modal decoder for caption generation. Through extensive experiments, we demonstrate the superiority of our CRSR method over existing state-of-the-art approaches on the RSICD, the UCM-Captions, and the Sydney-Captions datasets

Publisher

MDPI AG

Link

https://www.mdpi.com/2072-4292/16/1/196/pdf

Reference48 articles.

1. Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image?;Shi;IEEE Trans. Geosci. Remote Sens.,2017

2. Exploring models and data for remote sensing image caption generation;Lu;IEEE Trans. Geosci. Remote Sens.,2018

3. Post-disaster assessment with unmanned aerial vehicles: A survey on practical implementations and research approaches;Recchiuto;J. Field Robot.,2018

4. Fully-weighted HGNN: Learning efficient non-local relations with hypergraph in aerial imagery;Tian;ISPRS J. Photogram. Remote Sens.,2022

5. A comprehensive survey of deep learning for image captioning;Hossain;ACM Comput. Surv.,2019

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Incorporating object counts into remote sensing image captioning;International Journal of Digital Earth;2024-08-22

2. Utilising SkyScript for Open-Vocabulary Categorization, Extraction, and Captioning to Enhance Multi-Modal Tasks in Remote Sensing;Remote Sensing in Earth Systems Sciences;2024-07-25