1. BERT: pre-training of deep bidiretional transforomers for language understanding;Devlin,2019
2. Roberta: a robustly optimized BERT pretraining approach;Liu,2020
3. Attention is all you need;Vaswani,2017
4. An image is worth 16 × 16 words: transformers for image recognition at scale;Dosovitskiy,2020
5. Swin transformer: hierarchical vision transformer using shifted windows;Liu,2021