1. An image is worth 16x16 words: Trans-formers for image recognition at scale;dosovitskiy;International Conference on Learning Representations,2020
2. Unified perceptual parsing for scene understanding;xiao;Proceedings of the European Conference on Computer Vision (ECCV),2018
3. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping;dodge;ArXiv Preprint,2020
4. Towards Universal Object Detection by Domain Attention
5. Masked autoencoders as spatiotemporal learners;feichtenhofer;ArXiv Preprint,2022