An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval-Reference-Cited by-同舟云学术

An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval

Published:2023-05-13 Issue:10 Volume:11 Page:2279
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

He Liu¹^ORCID,Liu Shuyan¹^ORCID,An Ran¹,Zhuo Yudong¹,Tao Jian¹

Affiliation:

1. Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China

Abstract

Remote sensing cross-modal text-image retrieval (RSCTIR) has recently attracted extensive attention due to its advantages of fast extraction of remote sensing image information and flexible human–computer interaction. Traditional RSCTIR methods mainly focus on improving the performance of uni-modal feature extraction separately, and most rely on pre-trained object detectors to obtain better local feature representation, which not only lack multi-modal interaction information, but also cause the training gap between the pre-trained object detector and the retrieval task. In this paper, we propose an end-to-end RSCTIR framework based on vision-language fusion (EnVLF) consisting of two uni-modal (vision and language) encoders and a muti-modal encoder which can be optimized by multitask training. Specifically, to achieve an end-to-end training process, we introduce a vision transformer module for image local features instead of a pre-trained object detector. By semantic alignment of visual and text features, the vision transformer module achieves the same performance as pre-trained object detectors for image local features. In addition, the trained multi-modal encoder can improve the top-one and top-five ranking performances after retrieval processing. Experiments on common RSICD and RSITMD datasets demonstrate that our EnVLF can obtain state-of-the-art retrieval performance.

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2227-7390/11/10/2279/pdf

Reference42 articles.

1. Image retrieval from remote sensing big data: A survey;Li;Inf. Fusion,2021

2. Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6–8). Deep semantic understanding of high resolution remote sensing image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China.

3. Kim, W., Son, B., and Kim, I. (2021, January 18–24). Vilt: Vision-and-language transformer without convolution or region supervision. Proceedings of the International Conference on Machine Learning, Vienna, Austria.

4. Remote sensing cross-modal text-image retrieval based on global and local information;Yuan;IEEE Trans. Geosci. Remote Sens.,2022

5. Yuan, Z., Zhang, W., Fu, K., Li, X., Deng, C., Wang, H., and Sun, X. (2022). Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv.

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An Enhanced Feature Extraction Framework for Cross-Modal Image–Text Retrieval;Remote Sensing;2024-06-17

2. VL-Meta: Vision-Language Models for Multimodal Meta-Learning;Mathematics;2024-01-16

3. Language Integration in Remote Sensing: Tasks, datasets, and future directions;IEEE Geoscience and Remote Sensing Magazine;2023-12

4. A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure;Mathematics;2023-08-31

5. Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval;IEEE Transactions on Geoscience and Remote Sensing;2023