1. Attention is all you need;Vaswani,2017
2. VATT: transformers for multimodal self-supervised learning from raw video, audio and text;Akbari,2021
3. Uni-perceiver: pre-training unified architecture for generic perception for zero-shot and few-shot tasks;Zhu,2022
4. Vision transformer slimming: multi-dimension searching in continuous optimization space;Chavan,2022