VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning-Reference-Cited by-同舟云学术

VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning

Published:2024-08-12 Issue:16 Volume:16 Page:2961
ISSN:2072-4292
Container-title:Remote Sensing
language:en
Short-container-title:Remote Sensing

Author:

Cheng Qimin¹^ORCID,Xu Yuqi¹,Huang Ziyang¹

Affiliation:

1. School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China

Abstract

Pioneering remote sensing image captioning (RSIC) works use autoregressive decoding for fluent and coherent sentences but suffer from high latency and high computation costs. In contrast, non-autoregressive approaches improve inference speed by predicting multiple tokens simultaneously, though at the cost of performance due to a lack of sequential dependencies. Recently, diffusion model-based non-autoregressive decoding has shown promise in natural image captioning with iterative refinement, but its effectiveness is limited by the intrinsic characteristics of remote sensing images, which complicate robust input construction and affect the description accuracy. To overcome these challenges, we propose an innovative diffusion model for RSIC, named the Visual Conditional Control Diffusion Network (VCC-DiffNet). Specifically, we propose a Refined Multi-scale Feature Extraction (RMFE) module to extract the discernible visual context features of RSIs as input of the diffusion model-based non-autoregressive decoder to conditionally control a multi-step denoising process. Furthermore, we propose an Interactive Enhanced Decoder (IE-Decoder) utilizing dual image–description interactions to generate descriptions finely aligned with the image content. Experiments conducted on four representative RSIC datasets demonstrate that our non-autoregressive VCC-DiffNet performs comparably to, or even better than, popular autoregressive baselines in classic metrics, achieving around an 8.22× speedup in Sydney-Captions, an 11.61× speedup in UCM-Captions, a 15.20× speedup in RSICD, and an 8.13× speedup in NWPU-Captions.

Funder

National Key Research and Development Program of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2072-4292/16/16/2961/pdf

Reference46 articles.

1. A Systematic Survey of Remote Sensing Image Captioning;Zhao;IEEE Access,2021

2. Chen, T., Zhang, R., and Hinton, G.E. (2022). Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning. arXiv.

3. Fei, Z. (2019). Fast Image Caption Generation with Position Alignment. arXiv.

4. Li, Y., Zhou, K., Zhao, W.X., and Wen, J.-R. (2023). Diffusion Models for Non-autoregressive Text Generation: A Survey. arXiv.

5. Zhu, Z., Wei, Y., Wang, J., Gan, Z., Zhang, Z., Wang, L., Hua, G., Wang, L., Liu, Z., and Hu, H. (2022). Exploring Discrete Diffusion Models for Image Captioning. arXiv.