PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention
Author:
Ebert Nikolas12ORCID, Stricker Didier2ORCID, Wasenmüller Oliver1
Affiliation:
1. Research and Transfer Center CeMOS, Mannheim University of Applied Sciences, 68163 Mannheim, Germany 2. Department of Computer Science, RPTU Kaiserslautern-Landau, 67663 Kaiserslautern, Germany
Abstract
Recently, transformer architectures have shown superior performance compared to their CNN counterparts in many computer vision tasks. The self-attention mechanism enables transformer networks to connect visual dependencies over short as well as long distances, thus generating a large, sometimes even a global receptive field. In this paper, we propose our Parallel Local-Global Vision Transformer (PLG-ViT), a general backbone model that fuses local window self-attention with global self-attention. By merging these local and global features, short- and long-range spatial interactions can be effectively and efficiently represented without the need for costly computational operations such as shifted windows. In a comprehensive evaluation, we demonstrate that our PLG-ViT outperforms CNN-based as well as state-of-the-art transformer-based architectures in image classification and in complex downstream tasks such as object detection, instance segmentation, and semantic segmentation. In particular, our PLG-ViT models outperformed similarly sized networks like ConvNeXt and Swin Transformer, achieving Top-1 accuracy values of 83.4%, 84.0%, and 84.5% on ImageNet-1K with 27M, 52M, and 91M parameters, respectively.
Funder
Albert and Anneliese Konanz Foundation the German Research Foundation Research Germany in the project M2Aind-DeepLearning
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference69 articles.
1. He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27–30). Deep residual learning for image recognition. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA. 2. Convolutional networks with dense connectivity;Huang;Trans. Pattern Anal. Mach. Intell. (TPAMI),2019 3. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015, January 7–12). Going deeper with convolutions. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA. 4. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs;Chen;Trans. Pattern Anal. Mach. Intell. (TPAMI),2017 5. Schuster, R., Wasenmuller, O., Unger, C., and Stricker, D. (2019, January 15–20). Sdc-stacked dilated convolution: A unified descriptor network for dense matching tasks. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA.
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. FEDAF: frequency enhanced decomposed attention free transformer for long time series forecasting;Neural Computing and Applications;2024-05-25 2. FireViTNet: A hybrid model integrating ViT and CNNs for forest fire segmentation;Computers and Electronics in Agriculture;2024-03 3. A Review of the Development of Vision Transformer;2023 International Conference on Artificial Intelligence and Automation Control (AIAC);2023-11-17 4. Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images;2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW);2023-10-02 5. Light-Weight Vision Transformer with Parallel Local and Global Self-Attention;2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC);2023-09-24
|
|