1. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, in: 9th International Conference on Learning Representations, ICLR, 2021.
2. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou, Training data-efficient image transformers & distillation through attention, in: Proceedings of the 38th International Conference on Machine Learning, ICML, Vol. 139, 2021, pp. 10347–10357.
3. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet;Yuan,2021
4. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions;Wang,2021
5. Swin transformer: Hierarchical vision transformer using shifted windows;Liu,2021