1. ConViT: improving vision transformers with soft convolutional inductive biases*
2. An image is worth 16x16 words: Trans-formers for image recognition at scale;dosovitskiy;ArXiv Preprint,2020
3. Transformer in transformer;han;ArXiv Preprint,2021
4. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference
5. Visual transformer pruning;zhu;ArXiv Preprint,2021