RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery-Reference-Cited by-同舟云学术

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Published:2024-04-23 Issue:9 Volume:16 Page:1477
ISSN:2072-4292
Container-title:Remote Sensing
language:en
Short-container-title:Remote Sensing

Author:

Bazi Yakoub¹^ORCID,Bashmal Laila¹,Al Rahhal Mohamad Mahmoud²^ORCID,Ricci Riccardo³^ORCID,Melgani Farid³^ORCID

Affiliation:

1. Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

2. Applied Computer Science Department, College of Applied Computer Science, King Saud University, Riyadh 11543, Saudi Arabia

3. Department of Information Engineering and Computer Science, University of Trento, 38123 Trento, Italy

Abstract

In this paper, we delve into the innovative application of large language models (LLMs) and their extension, large vision-language models (LVLMs), in the field of remote sensing (RS) image analysis. We particularly emphasize their multi-tasking potential with a focus on image captioning and visual question answering (VQA). In particular, we introduce an improved version of the Large Language and Vision Assistant Model (LLaVA), specifically adapted for RS imagery through a low-rank adaptation approach. To evaluate the model performance, we create the RS-instructions dataset, a comprehensive benchmark dataset that integrates four diverse single-task datasets related to captioning and VQA. The experimental results confirm the model’s effectiveness, marking a step forward toward the development of efficient multi-task models for RS image analysis.

Funder

King Saud University

Publisher

MDPI AG

Link

https://www.mdpi.com/2072-4292/16/9/1477/pdf

Reference67 articles.

1. Language Integration in Remote Sensing: Tasks, datasets, and future directions;Bashmal;IEEE Geosci. Remote Sens. Mag.,2023

2. Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6–8). Deep semantic understanding of high resolution remote sensing image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China.

3. RSVQA: Visual Question Answering for Remote Sensing Data;Lobry;IEEE Trans. Geosci. Remote Sens.,2020

4. Abdullah, T., Bazi, Y., Al Rahhal, M.M., Mekhalfi, M.L., Rangarajan, L., and Zuair, M. (2020). TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images. Remote Sens., 12.

5. Text-to-Remote-Sensing-Image Generation with Structured Generative Adversarial Networks;Zhao;IEEE Geosci. Remote Sens. Lett.,2022