SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers-Reference-Cited by-同舟云学术

SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers

Published:2023-10-27 Issue: Volume: Page:
ISSN:0920-5691
Container-title:International Journal of Computer Vision
language:en
Short-container-title:Int J Comput Vis

Author:

Zhang Bowen,Liu Liyang,Phan Minh Hieu,Tian Zhi,Shen Chunhua,Liu Yifan^ORCID

Abstract

AbstractThis paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder–decoder framework and introduce SegViTv2. In this study, we introduce a novel Attention-to-Mask (ATM) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms popular decoder UPerNet using various ViT backbones while consuming only about

$$5\%$$

5 % of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a Shrunk++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to

$$50\%$$

50 % while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: https://github.com/zbwxp/SegVit.

Funder

The University of Adelaide

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

https://link.springer.com/content/pdf/10.1007/s11263-023-01894-8.pdf

Reference93 articles.

1. Bao, H., Dong, L., Piao, S., Wei, F. (2022). BEiT: BERT pre-training of image transformers, in International conference on learning representations, [Online]. Available: https://openreview.net/forum?id=p-BhZSz59o4

2. Bousselham, W., Thibault, G., Pagano, L., Machireddy, A., Gray, J., Chang, Y. H., Song, X. (2021). Efficient self-ensemble framework for semantic segmentation, arXiv preprintarXiv:2111.13280

3. Caesar, H., Uijlings, J., Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1209–1218.

4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. (2020). End-to-end object detection with transformers, in Proceedings European conference on computer vision (pp. 213–229), Springer.

5. Cermelli, F., Mancini, M., Bulò, S. R., Ricci, E., Caputo, B. (2020). Modeling the background for incremental learning in semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9230–9239.

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Decoupling semantic and localization for semantic segmentation via magnitude-aware and phase-sensitive learning;Information Fusion;2024-07

2. Towards Robust Semantic Segmentation against Patch-Based Attack via Attention Refinement;International Journal of Computer Vision;2024-06-07

3. A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images;Remote Sensing;2024-04-25

4. Few-shot semantic segmentation in complex industrial components;Multimedia Tools and Applications;2024-04-10

5. MCAT-UNet: Convolutional and Cross-Shaped Window Attention Enhanced UNet for Efficient High-Resolution Remote Sensing Image Segmentation;IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing;2024