P2FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification-Reference-Cited by-同舟云学术

P2FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification

Published:2023-03-26 Issue:7 Volume:15 Page:1773
ISSN:2072-4292
Container-title:Remote Sensing
language:en
Short-container-title:Remote Sensing

Author:

Wang Guanqun¹²^ORCID,Chen He¹²,Chen Liang¹²,Zhuang Yin¹²,Zhang Shanghang³,Zhang Tong¹²,Dong Hao³,Gao Peng⁴

Affiliation:

1. School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

2. Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing 100081, China

3. School of Computer Science, Peking University, Beijing 100871, China

4. Shanghai AI Laboratory, Shanghai 200232, China

Abstract

Remote sensing image classification (RSIC) is a classical and fundamental task in the intelligent interpretation of remote sensing imagery, which can provide unique labeling information for each acquired remote sensing image. Thanks to the potent global context information extraction ability of the multi-head self-attention (MSA) mechanism, visual transformer (ViT)-based architectures have shown excellent capability in natural scene image classification. However, in order to achieve powerful RSIC performance, it is insufficient to capture global spatial information alone. Specifically, for fine-grained target recognition tasks with high inter-class similarity, discriminative and effective local feature representations are key to correct classification. In addition, due to the lack of inductive biases, the powerful global spatial context representation capability of ViT requires lengthy training procedures and large-scale pre-training data volume. To solve the above problems, a hybrid architecture of convolution neural network (CNN) and ViT is proposed to improve the RSIC ability, called P2FEViT, which integrates plug-and-play CNN features with ViT. In this paper, the feature representation capabilities of CNN and ViT applying for RSIC are first analyzed. Second, aiming to integrate the advantages of CNN and ViT, a novel approach embedding CNN features into the ViT architecture is proposed, which can make the model synchronously capture and fuse global context and local multimodal information to further improve the classification capability of ViT. Third, based on the hybrid structure, only a simple cross-entropy loss is employed for model training. The model can also have rapid and comfortable convergence with relatively less training data than the original ViT. Finally, extensive experiments are conducted on the public and challenging remote sensing scene classification dataset of NWPU-RESISC45 (NWPU-R45) and the self-built fine-grained target classification dataset called BIT-AFGR50. The experimental results demonstrate that the proposed P2FEViT can effectively improve the feature description capability and obtain outstanding image classification performance, while significantly reducing the high dependence of ViT on large-scale pre-training data volume and accelerating the convergence speed. The code and self-built dataset will be released at our webpages.

Funder

National Science Foundation for Young Scientists of China

Space based on orbit real-time processing technology

National Natural Science Foundation of China

multi-source satellite data hardware acceleration computing method with low energy consumption

Publisher

MDPI AG

Subject

General Earth and Planetary Sciences

Link

https://www.mdpi.com/2072-4292/15/7/1773/pdf

Reference74 articles.

1. Road recognition from remote sensing imagery using incremental learning;Zhang;IEEE Trans. Intell. Transp. Syst.,2017

2. Convolution neural network in precision agriculture for plant image recognition and classification;Abdullahi;Proceedings of the 2017 Seventh International Conference on Innovative Computing Technology (INTECH),2017

3. Remote sensing for urban planning and management: The use of window-independent context segmentation to extract urban features in Stockholm;Nielsen;Comput. Environ. Urban Syst.,2015

4. Multilayer feature extraction network for military ship detection from high-resolution optical remote sensing images;Qin;IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.,2021

5. Utility of Satellite and Aerial Images for Quantification of Canopy Cover and Infilling Rates of the Invasive Woody Species Honey Mesquite (Prosopis Glandulosa) on Rangeland;Mirik;Remote. Sens.,2012

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Efficient knowledge distillation for hybrid models: A vision transformer‐convolutional neural network to convolutional neural network approach for classifying remote sensing images;IET Cyber-Systems and Robotics;2024-07-10

2. ERKT-Net: Implementing Efficient and Robust Knowledge Distillation for Remote Sensing Image Classification;EAI Endorsed Transactions on Industrial Networks and Intelligent Systems;2024-07-03

3. AUXG: Deep Feature Extraction and Classification of Remote Sensing Image Scene Using Attention Unet and XGBoost;Journal of the Indian Society of Remote Sensing;2024-06-15

4. A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges;Information Fusion;2024-05

5. Quantitative regularization in robust vision transformer for remote sensing image classification;The Photogrammetric Record;2024-04-24