1. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proc of the ICLR. OpenReview.net, Austria
2. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proc. of the IEEE/CVF ICCV. IEEE, Montreal, BC, Canada, p 10012–10022
3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proc of the NeurIPS, vol 30. Curran Associates Inc.57, Long Beach, CA, USA
4. Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. preprint ArXiv:1609.08144
5. Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B (2019) Megatron-LM: Training multi-billion parameter language models using model parallelism. preprint ArXiv:1909.08053