A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing-Reference-Cited by-同舟云学术

A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing

Published:2023-09-21 Issue:18 Volume:15 Page:4637
ISSN:2072-4292
Container-title:Remote Sensing
language:en
Short-container-title:Remote Sensing

Author:

Zhang Xiong¹,Li Weipeng¹^ORCID,Wang Xu¹,Wang Luyao¹,Zheng Fuzhong¹,Wang Long¹,Zhang Haisu¹

Affiliation:

1. School of Information and Communication, National University of Defense Technology, Wuhan 430074, China

Abstract

In recent years, there has been a growing interest in remote sensing image–text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image–text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model’s consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Subject

General Earth and Planetary Sciences

Link

https://www.mdpi.com/2072-4292/15/18/4637/pdf

Reference53 articles.

1. Simonyan, K., and Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv.

2. He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv.

3. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021). An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv.

4. LSTM: A Search Space Odyssey;Greff;IEEE Trans. Neural Networks Learn. Syst.,2017

5. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An Enhanced Feature Extraction Framework for Cross-Modal Image–Text Retrieval;Remote Sensing;2024-06-17

2. DMCH: A Deep Metric and Category-Level Semantic Hashing Network for Retrieval in Remote Sensing;Remote Sensing;2023-12-25