A novel dual-granularity lightweight transformer for vision tasks-Reference-Cited by-同舟云学术

A novel dual-granularity lightweight transformer for vision tasks

Published:2024-01-18 Issue: Volume: Page:1-16
ISSN:1088-467X
Container-title:Intelligent Data Analysis
language:
Short-container-title:IDA

Author:

Zhang Ji¹²,Yu Mingxin¹²,Lu Wenshuai³,Dai Yuxiang³,Shi Huiyu³,You Rui¹²

Affiliation:

1. School of Instrument Science and Opto-Electronics Engineering, Beijing Information Science and Technology University, Beijing, China

2. Laboratory of Intelligent Microsystems, Beijing Information Science and Technology University, Beijing, China

3. Department of Precision Instrument, Tsinghua University, Beijing, China

Abstract

Transformer-based networks have revolutionized visual tasks with their continuous innovation, leading to significant progress. However, the widespread adoption of Vision Transformers (ViT) is limited due to their high computational and parameter requirements, making them less feasible for resource-constrained mobile and edge computing devices. Moreover, existing lightweight ViTs exhibit limitations in capturing different granular features, extracting local features efficiently, and incorporating the inductive bias inherent in convolutional neural networks. These limitations somewhat impact the overall performance. To address these limitations, we propose an efficient ViT called Dual-Granularity Former (DGFormer). DGFormer mitigates these limitations by introducing two innovative modules: Dual-Granularity Attention (DG Attention) and Efficient Feed-Forward Network (Efficient FFN). In our experiments, on the image recognition task of ImageNet, DGFormer surpasses lightweight models such as PVTv2-B0 and Swin Transformer by 2.3% in terms of Top1 accuracy. On the object detection task of COCO, under RetinaNet detection framework, DGFormer outperforms PVTv2-B0 and Swin Transformer with increase of 0.5% and 2.4% in average precision (AP), respectively. Similarly, under Mask R-CNN detection framework, DGFormer exhibits improvement of 0.4% and 1.8% in AP compared to PVTv2-B0 and Swin Transformer, respectively. On the semantic segmentation task on the ADE20K, DGFormer achieves a substantial improvement of 2.0% and 2.5% in mean Intersection over Union (mIoU) over PVTv2-B0 and Swin Transformer, respectively. The code is open-source and available at: https://github.com/ISCLab-Bistu/DGFormer.git.

Publisher

IOS Press

Reference47 articles.

1. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin and B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.

2. W. Wang, E. Xie, X. Li, D.P. Fan, K. Song, D. Liang, T. Lu, P. Luo and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.

3. K. He, X. Zhang, S. Ren and J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 770–778.

4. G. Huang, S. Liu, L. Van der Maaten and K.Q. Weinberger, Condensenet: An efficient densenet using learned group convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2752–2761.

5. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles and H. Jégou, Training data-efficient image transformers and distillation through attention, in: International Conference on Machine Learning, PMLR, 2021, July, pp. 10347–10357.