1. Unsuper-vised visual representation learning by context prediction;doersch;Proceedings of the IEEE International Conference on Computer Vision,0
2. Bert: Pre-training of deep bidirectional transformers for language understanding;devlin;ArXiv Preprint,2018
3. Scaling open-vocabulary image segmentation with image-level labels;ghiasi;European Conference on Computer Vision,0
4. An image is worth 16×16 words: Transformers for image recognition at scale;dosovitskiy;International Conference on Learning Representations,0
5. Extract free dense labels from clip;zhou;European Conference on Computer Vision,0