1. Pyra-mid vision transformer: A versatile backbone for dense prediction without convolutions;wang;ICCV,2021
2. Bert: Pre-training of deep bidirectional transformers for language understanding;devlin;NAACL,2019
3. Attention is all you need;vaswani;NeurIPS,2017
4. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense