Rethinking Attention Mechanisms in Vision Transformers with Graph Structures
Author:
Kim Hyeongjin1, Ko Byoung Chul1ORCID
Affiliation:
1. Department of Computer Engineering, Keimyung University, Daegu 42601, Republic of Korea
Abstract
In this paper, we propose a new type of vision transformer (ViT) based on graph head attention (GHA). Because the multi-head attention (MHA) of a pure ViT requires multiple parameters and tends to lose the locality of an image, we replaced MHA with GHA by applying a graph to the attention head of the transformer. Consequently, the proposed GHA maintains both the locality and globality of the input patches and guarantees the diversity of the attention. The proposed GHA-ViT commonly outperforms pure ViT-based models using small-sized CIFAR-10/100, MNIST, and MNIST-F datasets and a medium-sized ImageNet-1K dataset in scratch training. A Top-1 accuracy of 81.7% was achieved for ImageNet-1K using GHA-B, which is a base model with approximately 29 M parameters. In addition, with CIFAR-10/100, the existing ViT and parameters are reduced 17-fold and the performance increased by 0.4/4.3%, respectively. The proposed GHA-ViT shows promising results in terms of the number of parameters and operations and the level of accuracy in comparison with other state-of-the-art ViT-lightweight models.
Funder
Scholar Research Grant of Keimyung University
Reference50 articles.
1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4–9). Attention is all you need. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA. 2. Pham, N.-Q., Nguyen, T.S., Niehues, J., Müller, M., Stüker, S., and Waibel, A.H. (2019, January 15–19). Very deep self-attention networks for end-to-end speech recognition. Proceedings of the Interspeech, Graz, Austria. 3. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2021, January 3–7). An image is worth 16 × 16 words: Transformers for image recognition at scale. Proceedings of the International Conference on Learning Representations (ICLR), Virtual. 4. Yan, X., Tang, H., Sun, S., Ma, H., Kong, D., and Xie, X. (2022, January 4–8). After-unet: Axial fusion transformer unet for medical image segmentation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA. 5. Zeng, F., Dong, B., Zhang, Y., Wang, T., Zhang, X., and Wei, Y. (2022, January 23–27). Motr: End-to-end multiple-object tracking with transformer. Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel.
|
|