Affiliation:
1. Department of Big Data Research and Application Technology, China Aero-Polytechnology Establishment, Beijing 100028, China
Abstract
Remote sensing cross-modal text-image retrieval (RSCTIR) has recently attracted extensive attention due to its advantages of fast extraction of remote sensing image information and flexible human–computer interaction. Traditional RSCTIR methods mainly focus on improving the performance of uni-modal feature extraction separately, and most rely on pre-trained object detectors to obtain better local feature representation, which not only lack multi-modal interaction information, but also cause the training gap between the pre-trained object detector and the retrieval task. In this paper, we propose an end-to-end RSCTIR framework based on vision-language fusion (EnVLF) consisting of two uni-modal (vision and language) encoders and a muti-modal encoder which can be optimized by multitask training. Specifically, to achieve an end-to-end training process, we introduce a vision transformer module for image local features instead of a pre-trained object detector. By semantic alignment of visual and text features, the vision transformer module achieves the same performance as pre-trained object detectors for image local features. In addition, the trained multi-modal encoder can improve the top-one and top-five ranking performances after retrieval processing. Experiments on common RSICD and RSITMD datasets demonstrate that our EnVLF can obtain state-of-the-art retrieval performance.
Subject
General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)
Reference42 articles.
1. Image retrieval from remote sensing big data: A survey;Li;Inf. Fusion,2021
2. Qu, B., Li, X., Tao, D., and Lu, X. (2016, January 6–8). Deep semantic understanding of high resolution remote sensing image. Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (Cits), Kunming, China.
3. Kim, W., Son, B., and Kim, I. (2021, January 18–24). Vilt: Vision-and-language transformer without convolution or region supervision. Proceedings of the International Conference on Machine Learning, Vienna, Austria.
4. Remote sensing cross-modal text-image retrieval based on global and local information;Yuan;IEEE Trans. Geosci. Remote Sens.,2022
5. Yuan, Z., Zhang, W., Fu, K., Li, X., Deng, C., Wang, H., and Sun, X. (2022). Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. arXiv.
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献