MResTNet: A Multi-Resolution Transformer Framework with CNN Extensions for Semantic Segmentation-Reference-Cited by-同舟云学术

MResTNet: A Multi-Resolution Transformer Framework with CNN Extensions for Semantic Segmentation

Published:2024-05-21 Issue:6 Volume:10 Page:125
ISSN:2313-433X
Container-title:Journal of Imaging
language:en
Short-container-title:J. Imaging

Author:

Detsikas Nikolaos¹^ORCID,Mitianoudis Nikolaos¹^ORCID,Pratikakis Ioannis¹^ORCID

Affiliation:

1. Electrical and Computer Engineering Department, Democritus University of Thrace, University Campus Xanthi-Kimmeria, 67100 Xanthi, Greece

Abstract

A fundamental task in computer vision is the process of differentiation and identification of different objects or entities in a visual scene using semantic segmentation methods. The advancement of transformer networks has surpassed traditional convolutional neural network (CNN) architectures in terms of segmentation performance. The continuous pursuit of optimal performance, with respect to the popular evaluation metric results, has led to very large architectures that require a significant amount of computational power to operate, making them prohibitive for real-time applications, including autonomous driving. In this paper, we propose a model that leverages a visual transformer encoder with a parallel twin decoder, consisting of a visual transformer decoder and a CNN decoder with multi-resolution connections working in parallel. The two decoders are merged with the aid of two trainable CNN blocks, the fuser that combined the information from the two decoders and the scaler that scales the contribution of each decoder. The proposed model achieves state-of-the-art performance on the Cityscapes and ADE20K datasets, maintaining a low-complexity network that can be used in real-time applications.

Publisher

MDPI AG

Link

https://www.mdpi.com/2313-433X/10/6/125/pdf

Reference29 articles.

1. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021). An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv.

2. Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyramid Scene Parsing Network. arXiv.

3. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv.

4. Wang, W., Dai, J., Chen, Z., Huang, Z., Li, Z., Zhu, X., Hu, X., Lu, T., Lu, L., and Li, H. (2023). InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv.

5. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and Wei, Y. (2017). Deformable Convolutional Networks. arXiv.