1. Dosovitskiy, A., et al.: An Image is Worth 16X16 Words: Transformers for Image Recognition at Scale. Published as a conference paper at ICLR (2021)
2. Child, R., Gray, S., Radford, A., Sutskever, A.: Generating long sequences with sparse transformers. arXiv:1904.10509v1 [cs.LG] (2019)
3. Jing, L., Tian, Y.: Self-supervised visual feature learning with deep neural networks: a survey. arXiv:1902.06162v1 [cs.CV] (2019)
4. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531v1 [stat.ML] (2015)
5. Liu, X., He, P., Chen, W., Gao, J.: Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv:1904.09482v1 [cs.CL] (2019)