An Enhanced Feature Extraction Framework for Cross-Modal Image–Text Retrieval
-
Published:2024-06-17
Issue:12
Volume:16
Page:2201
-
ISSN:2072-4292
-
Container-title:Remote Sensing
-
language:en
-
Short-container-title:Remote Sensing
Author:
Zhang Jinzhi1, Wang Luyao1, Zheng Fuzhong1, Wang Xu1, Zhang Haisu1
Affiliation:
1. School of Information and Communication, National University of Defense Technology, Wuhan 430030, China
Abstract
In general, remote sensing images depict intricate scenes. In cross-modal retrieval tasks involving remote sensing images, the accompanying text includes numerus information with an emphasis on mainly large objects due to higher attention, and the features from small targets are often omitted naturally. While the conventional vision transformer (ViT) method adeptly captures information regarding large global targets, its capability to extract features of small targets is limited. This limitation stems from the constrained receptive field in ViT’s self-attention layer, which hinders the extraction of information pertaining to small targets due to interference from large targets. To address this concern, this study introduces a patch classification framework based on feature similarity, which establishes distinct receptive fields in the feature space to mitigate interference from large targets on small ones, thereby enhancing the ability of traditional ViT to extract features from small targets. We conducted evaluation experiments on two popular datasets—the Remote Sensing Image–Text Match Dataset (RSITMD) and the Remote Sensing Image Captioning Dataset (RSICD)—resulting in mR indices of 35.6% and 19.47%, respectively. The proposed approach contributes to improving the detection accuracy of small targets and can be applied to more complex image–text retrieval tasks involving multi-scale ground objects.
Funder
National Natural Science Foundation of China
Reference26 articles.
1. Zhang, X., Li, W., Wang, X., Wang, L., Zheng, F., Wang, L., and Zhang, H. (2023). A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text–Image Retrieval in Remote Sensing. Remote Sens., 15. 2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv, preprint. 3. Zheng, F., Wang, X., Wang, L., Zhang, X., Zhu, H., Wang, L., and Zhang, H. (2023). A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval. Sensors, 23. 4. Yang, L., Feng, Y., Zhou, M., Xiong, X., Wang, Y., and Qiang, B. (2023). A Jointly Guided Deep Network for Fine-Grained Cross-Modal Remote Sensing Text–Image Retrieval. J. Circuits Syst. Comput., 32. 5. A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing;Cheng;IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.,2021
|
|