SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers
-
Published:2023-10-27
Issue:
Volume:
Page:
-
ISSN:0920-5691
-
Container-title:International Journal of Computer Vision
-
language:en
-
Short-container-title:Int J Comput Vis
Author:
Zhang Bowen, Liu Liyang, Phan Minh Hieu, Tian Zhi, Shen Chunhua, Liu YifanORCID
Abstract
AbstractThis paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder–decoder framework and introduce SegViTv2. In this study, we introduce a novel Attention-to-Mask (ATM) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms popular decoder UPerNet using various ViT backbones while consuming only about $$5\%$$
5
%
of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a Shrunk++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to $$50\%$$
50
%
while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: https://github.com/zbwxp/SegVit.
Funder
The University of Adelaide
Publisher
Springer Science and Business Media LLC
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Software
Reference93 articles.
1. Bao, H., Dong, L., Piao, S., Wei, F. (2022). BEiT: BERT pre-training of image transformers, in International conference on learning representations, [Online]. Available: https://openreview.net/forum?id=p-BhZSz59o4 2. Bousselham, W., Thibault, G., Pagano, L., Machireddy, A., Gray, J., Chang, Y. H., Song, X. (2021). Efficient self-ensemble framework for semantic segmentation, arXiv preprintarXiv:2111.13280 3. Caesar, H., Uijlings, J., Ferrari, V. (2018). Coco-stuff: Thing and stuff classes in context, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1209–1218. 4. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S. (2020). End-to-end object detection with transformers, in Proceedings European conference on computer vision (pp. 213–229), Springer. 5. Cermelli, F., Mancini, M., Bulò, S. R., Ricci, E., Caputo, B. (2020). Modeling the background for incremental learning in semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9230–9239.
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|