1. Attention is all you need;Vaswani
2. BERT: Pre-training of deep bidirectional transformers for language understanding;Devlin
3. Language models are few-shot learners;Brown
4. An image is worth 16 × 16 words: Transformers for image recognition at scale;Dosovitskiy;arXiv:2010.11929,2020
5. VisualBERT: A simple and performant baseline for vision and language;Harold Li;arXiv:1908.03557,2019