PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention-Reference-Cited by-同舟云学术

PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention

Published:2023-03-25 Issue:7 Volume:23 Page:3447
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Ebert Nikolas¹²^ORCID,Stricker Didier²^ORCID,Wasenmüller Oliver¹

Affiliation:

1. Research and Transfer Center CeMOS, Mannheim University of Applied Sciences, 68163 Mannheim, Germany

2. Department of Computer Science, RPTU Kaiserslautern-Landau, 67663 Kaiserslautern, Germany

Abstract

Recently, transformer architectures have shown superior performance compared to their CNN counterparts in many computer vision tasks. The self-attention mechanism enables transformer networks to connect visual dependencies over short as well as long distances, thus generating a large, sometimes even a global receptive field. In this paper, we propose our Parallel Local-Global Vision Transformer (PLG-ViT), a general backbone model that fuses local window self-attention with global self-attention. By merging these local and global features, short- and long-range spatial interactions can be effectively and efficiently represented without the need for costly computational operations such as shifted windows. In a comprehensive evaluation, we demonstrate that our PLG-ViT outperforms CNN-based as well as state-of-the-art transformer-based architectures in image classification and in complex downstream tasks such as object detection, instance segmentation, and semantic segmentation. In particular, our PLG-ViT models outperformed similarly sized networks like ConvNeXt and Swin Transformer, achieving Top-1 accuracy values of 83.4%, 84.0%, and 84.5% on ImageNet-1K with 27M, 52M, and 91M parameters, respectively.

Funder

Albert and Anneliese Konanz Foundation

the German Research Foundation

Research Germany in the project M2Aind-DeepLearning

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/23/7/3447/pdf

Reference69 articles.

1. He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27–30). Deep residual learning for image recognition. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.

2. Convolutional networks with dense connectivity;Huang;Trans. Pattern Anal. Mach. Intell. (TPAMI),2019

3. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7–12). Going deeper with convolutions. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.

4. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs;Chen;Trans. Pattern Anal. Mach. Intell. (TPAMI),2017

5. Schuster, R., Wasenmuller, O., Unger, C., and Stricker, D. (2019, January 15–20). Sdc-stacked dilated convolution: A unified descriptor network for dense matching tasks. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. FEDAF: frequency enhanced decomposed attention free transformer for long time series forecasting;Neural Computing and Applications;2024-05-25

2. FireViTNet: A hybrid model integrating ViT and CNNs for forest fire segmentation;Computers and Electronics in Agriculture;2024-03

3. A Review of the Development of Vision Transformer;2023 International Conference on Artificial Intelligence and Automation Control (AIAC);2023-11-17

4. Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images;2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW);2023-10-02

5. Light-Weight Vision Transformer with Parallel Local and Global Self-Attention;2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC);2023-09-24