An Enhanced Feature Extraction Framework for Cross-Modal Image–Text Retrieval-Reference-Cited by-同舟云学术

An Enhanced Feature Extraction Framework for Cross-Modal Image–Text Retrieval

Published:2024-06-17 Issue:12 Volume:16 Page:2201
ISSN:2072-4292
Container-title:Remote Sensing
language:en
Short-container-title:Remote Sensing

Author:

Zhang Jinzhi¹,Wang Luyao¹,Zheng Fuzhong¹,Wang Xu¹,Zhang Haisu¹

Affiliation:

1. School of Information and Communication, National University of Defense Technology, Wuhan 430030, China

Abstract

In general, remote sensing images depict intricate scenes. In cross-modal retrieval tasks involving remote sensing images, the accompanying text includes numerus information with an emphasis on mainly large objects due to higher attention, and the features from small targets are often omitted naturally. While the conventional vision transformer (ViT) method adeptly captures information regarding large global targets, its capability to extract features of small targets is limited. This limitation stems from the constrained receptive field in ViT’s self-attention layer, which hinders the extraction of information pertaining to small targets due to interference from large targets. To address this concern, this study introduces a patch classification framework based on feature similarity, which establishes distinct receptive fields in the feature space to mitigate interference from large targets on small ones, thereby enhancing the ability of traditional ViT to extract features from small targets. We conducted evaluation experiments on two popular datasets—the Remote Sensing Image–Text Match Dataset (RSITMD) and the Remote Sensing Image Captioning Dataset (RSICD)—resulting in mR indices of 35.6% and 19.47%, respectively. The proposed approach contributes to improving the detection accuracy of small targets and can be applied to more complex image–text retrieval tasks involving multi-scale ground objects.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2072-4292/16/12/2201/pdf

Reference26 articles.

1. Zhang, X., Li, W., Wang, X., Wang, L., Zheng, F., Wang, L., and Zhang, H. (2023). A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing. Remote Sens., 15.

2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv, preprint.

3. Zheng, F., Wang, X., Wang, L., Zhang, X., Zhu, H., Wang, L., and Zhang, H. (2023). A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval. Sensors, 23.

4. Yang, L., Feng, Y., Zhou, M., Xiong, X., Wang, Y., and Qiang, B. (2023). A Jointly Guided Deep Network for Fine-Grained Cross-Modal Remote Sensing Text–Image Retrieval. J. Circuits Syst. Comput., 32.

5. A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing;Cheng;IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.,2021