PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention

Author:

Ebert Nikolas12ORCID,Stricker Didier2ORCID,Wasenmüller Oliver1

Affiliation:

1. Research and Transfer Center CeMOS, Mannheim University of Applied Sciences, 68163 Mannheim, Germany

2. Department of Computer Science, RPTU Kaiserslautern-Landau, 67663 Kaiserslautern, Germany

Abstract

Recently, transformer architectures have shown superior performance compared to their CNN counterparts in many computer vision tasks. The self-attention mechanism enables transformer networks to connect visual dependencies over short as well as long distances, thus generating a large, sometimes even a global receptive field. In this paper, we propose our Parallel Local-Global Vision Transformer (PLG-ViT), a general backbone model that fuses local window self-attention with global self-attention. By merging these local and global features, short- and long-range spatial interactions can be effectively and efficiently represented without the need for costly computational operations such as shifted windows. In a comprehensive evaluation, we demonstrate that our PLG-ViT outperforms CNN-based as well as state-of-the-art transformer-based architectures in image classification and in complex downstream tasks such as object detection, instance segmentation, and semantic segmentation. In particular, our PLG-ViT models outperformed similarly sized networks like ConvNeXt and Swin Transformer, achieving Top-1 accuracy values of 83.4%, 84.0%, and 84.5% on ImageNet-1K with 27M, 52M, and 91M parameters, respectively.

Funder

Albert and Anneliese Konanz Foundation

the German Research Foundation

Research Germany in the project M2Aind-DeepLearning

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Cited by 5 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. FEDAF: frequency enhanced decomposed attention free transformer for long time series forecasting;Neural Computing and Applications;2024-05-25

2. FireViTNet: A hybrid model integrating ViT and CNNs for forest fire segmentation;Computers and Electronics in Agriculture;2024-03

3. A Review of the Development of Vision Transformer;2023 International Conference on Artificial Intelligence and Automation Control (AIAC);2023-11-17

4. Transformer-based Detection of Microorganisms on High-Resolution Petri Dish Images;2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW);2023-10-02

5. Light-Weight Vision Transformer with Parallel Local and Global Self-Attention;2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC);2023-09-24

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3