Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering

Author:

Liu Gang12ORCID,He Jinlong12ORCID,Li Pengfei12ORCID,Zhong Shenjun3ORCID,Li Hongyang12,He Genrong12

Affiliation:

1. College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China

2. National Engineering Laboratory of E-Government Modeling Simulation, Harbin Engineering University, Harbin 150001, China

3. Monash Biomedical Imaging, Australia and National Imaging Facility, Monash University, Victoria 3800, Australia

Abstract

Remote-sensing visual question answering (RSVQA) aims to provide accurate answers to remote sensing images and their associated questions by leveraging both visual and textual information during the inference process. However, most existing methods ignore the significance of the interaction between visual and language features, which typically adopt simple feature fusion strategies and fail to adequately model cross-modal attention, struggling to capture the complex semantic relationships between questions and images. In this study, we introduce a unified transformer with cross-modal mixture expert (TCMME) model to address the RSVQA problem. Specifically, we utilize the vision transformer (VIT) and BERT to extract visual and language features, respectively. Furthermore, we incorporate cross-modal mixture experts (CMMEs) to facilitate cross-modal representation learning. By leveraging the shared self-attention and cross-modal attention within CMMEs, as well as the modality experts, we effectively capture the intricate interactions between visual and language features and better focus on their complex semantic relationships. Finally, we conduct qualitative and quantitative experiments on two benchmark datasets: RSVQA-LR and RSVQA-HR. The results demonstrate that our proposed method surpasses the current state-of-the-art (SOTA) techniques. Additionally, we perform an extensive analysis to validate the effectiveness of different components in our framework.

Publisher

MDPI AG

Subject

General Earth and Planetary Sciences

Reference69 articles.

1. Generalized Scene Classification From Small-Scale Datasets with Multitask Learning;Zheng;IEEE Trans. Geosci. Remote Sens.,2022

2. Fast and Robust Matching for Multimodal Remote Sensing Image Registration;Ye;IEEE Trans. Geosci. Remote Sens.,2019

3. A case study on the relation between city planning and urban growth using remote sensing and spatial metrics;Pham;Landsc. Urban Plan.,2011

4. Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA;Cheng;Int. J. Remote Sens.,2013

5. Jahromi, M.N., Jahromi, M.N., Pourghasemi, H.R., Zand-Parsa, S., and Jamshidi, S. (2021). Forest Resources Resilience and Conflicts, Elsevier.

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3