1. JHU-CROWD++: Large-scale crowd counting dataset and a benchmark method;Sindagi;IEEE Trans. Pattern Anal. Mach. Intell.,2020
2. Attention is all you need;Vaswani,2017
3. An image is worth 16x16 words: Transformers for image recognition at scale;Dosovitskiy,2021
4. Twins: Revisiting the design of spatial attention in vision transformers;Chu,2021
5. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions;Wang,2021