PVT v2: Improved baselines with Pyramid Vision Transformer-Reference-Cited by-同舟云学术

PVT v2: Improved baselines with Pyramid Vision Transformer

Published:2022-03-16 Issue:3 Volume:8 Page:415-424
ISSN:2096-0433
Container-title:Computational Visual Media
language:en
Short-container-title:Comp. Visual Media

Author:

Wang Wenhai,Xie Enze,Li Xiang,Fan Deng-Ping,Song Kaitao,Liang Ding,Lu Tong,Luo Ping,Shao Ling

Abstract

AbstractTransformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT.

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Graphics and Computer-Aided Design,Computer Vision and Pattern Recognition

Link

https://link.springer.com/content/pdf/10.1007/s41095-022-0274-8.pdf

Reference42 articles.

1. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; S. Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2021.

2. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Jegou, H. Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning, 2021.

3. Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 568–578, 2021.

4. Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31, 2021.

5. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012–10022, 2021.

Cited by 458 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enhanced encoder–decoder architecture for visual perception multitasking of autonomous driving;Expert Systems with Applications;2024-07

2. MSE-Net: A novel master–slave encoding network for remote sensing scene classification;Engineering Applications of Artificial Intelligence;2024-06

3. Vision transformer: To discover the “four secrets” of image patches;Information Fusion;2024-05

4. In-bed human pose estimation using multi-source information fusion for health monitoring in real-world scenarios;Information Fusion;2024-05

5. FCA-Net: Fully context-aware feature aggregation network for medical segmentation;Biomedical Signal Processing and Control;2024-05