1. Xcit: Cross-covariance image transformers;Ali;NeurIPS,2021
2. Layer normalization;Ba,2016
3. MultiMAE: Multi-modal Multi-task Masked Autoencoders
4. Data2vec: A general framework for self-supervised learning in speech, vision and language;Baevski,2022
5. Beit: Bert pre-training of image transformers;Bao