1. An image is worth 16×16 words: Transformers for image recognition at scale;dosovitskiy;In ICLR,0
2. RAFT: Recurrent all-pairs field transforms for optical flow;teed;In ECCV,0
3. ImageNet: A large-scale hierarchical image database
4. VideoBERT: A Joint Model for Video and Language Representation Learning
5. Savi++: Towards end-to-end object-centric learning from real-world videos;elsayed;NeurIPS,0