1. Per-pixel classification is not all you need for semantic segmen-tation;cheng;NeurIPS,2021
2. Glipv2: Unifying localization and vision-language understanding;zhang;NeurIPS,2022
3. Uniter: Universal image-text representation learning;chen;ECCV,2020
4. Multi-grained vi-sion language pre-training: Aligning texts with visual concepts;zeng;ICML,2022
5. ImageNet: A large-scale hierarchical image database