1. Vse++: Improving visual-semantic embeddings with hard negatives;faghri;ArXiv Preprint,2017
2. Beit v2: Masked image modeling with vector-quantized visual tokenizers;peng;ArXiv Preprint,2022
3. An image is worth 16x16 words: Transformers for image recognition at scale;dosovitskiy;ICLRE,0
4. Support-set bottlenecks for video-text representation learning;patrick;ArXiv Preprint,2020
5. Masked autoencoders as spatiotemporal learners;feichtenhofer;NeurIPS,0