1. Attention is all you need;Vaswani;Advances in neural information processing systems,2017
2. An image is worth 16x16 words: Transformers for image recognition at scale;Dosovitskiy;arXiv preprint,2020
3. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition;Dong,2018
4. Exploring the limits of transfer learning with a unified text-to-text transformer;Rafel;J. Mach. Learn. Res.,2020
5. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension;Lewis;arXiv preprint,2019