P2FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification

Author:

Wang Guanqun12ORCID,Chen He12,Chen Liang12,Zhuang Yin12,Zhang Shanghang3,Zhang Tong12,Dong Hao3,Gao Peng4

Affiliation:

1. School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

2. Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing 100081, China

3. School of Computer Science, Peking University, Beijing 100871, China

4. Shanghai AI Laboratory, Shanghai 200232, China

Abstract

Remote sensing image classification (RSIC) is a classical and fundamental task in the intelligent interpretation of remote sensing imagery, which can provide unique labeling information for each acquired remote sensing image. Thanks to the potent global context information extraction ability of the multi-head self-attention (MSA) mechanism, visual transformer (ViT)-based architectures have shown excellent capability in natural scene image classification. However, in order to achieve powerful RSIC performance, it is insufficient to capture global spatial information alone. Specifically, for fine-grained target recognition tasks with high inter-class similarity, discriminative and effective local feature representations are key to correct classification. In addition, due to the lack of inductive biases, the powerful global spatial context representation capability of ViT requires lengthy training procedures and large-scale pre-training data volume. To solve the above problems, a hybrid architecture of convolution neural network (CNN) and ViT is proposed to improve the RSIC ability, called P2FEViT, which integrates plug-and-play CNN features with ViT. In this paper, the feature representation capabilities of CNN and ViT applying for RSIC are first analyzed. Second, aiming to integrate the advantages of CNN and ViT, a novel approach embedding CNN features into the ViT architecture is proposed, which can make the model synchronously capture and fuse global context and local multimodal information to further improve the classification capability of ViT. Third, based on the hybrid structure, only a simple cross-entropy loss is employed for model training. The model can also have rapid and comfortable convergence with relatively less training data than the original ViT. Finally, extensive experiments are conducted on the public and challenging remote sensing scene classification dataset of NWPU-RESISC45 (NWPU-R45) and the self-built fine-grained target classification dataset called BIT-AFGR50. The experimental results demonstrate that the proposed P2FEViT can effectively improve the feature description capability and obtain outstanding image classification performance, while significantly reducing the high dependence of ViT on large-scale pre-training data volume and accelerating the convergence speed. The code and self-built dataset will be released at our webpages.

Funder

National Science Foundation for Young Scientists of China

Space based on orbit real-time processing technology

National Natural Science Foundation of China

multi-source satellite data hardware acceleration computing method with low energy consumption

Publisher

MDPI AG

Subject

General Earth and Planetary Sciences

Reference74 articles.

1. Road recognition from remote sensing imagery using incremental learning;Zhang;IEEE Trans. Intell. Transp. Syst.,2017

2. Convolution neural network in precision agriculture for plant image recognition and classification;Abdullahi;Proceedings of the 2017 Seventh International Conference on Innovative Computing Technology (INTECH),2017

3. Remote sensing for urban planning and management: The use of window-independent context segmentation to extract urban features in Stockholm;Nielsen;Comput. Environ. Urban Syst.,2015

4. Multilayer feature extraction network for military ship detection from high-resolution optical remote sensing images;Qin;IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.,2021

5. Utility of Satellite and Aerial Images for Quantification of Canopy Cover and Infilling Rates of the Invasive Woody Species Honey Mesquite (Prosopis Glandulosa) on Rangeland;Mirik;Remote. Sens.,2012

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3