Author:
Miao Zhuangzhuang,Zhang Yong,Peng Yuan,Peng Haocheng,Yin Baocai
Abstract
AbstractCrowd counting provides an important foundation for public security and urban management. Due to the existence of small targets and large density variations in crowd images, crowd counting is a challenging task. Mainstream methods usually apply convolution neural networks (CNNs) to regress a density map, which requires annotations of individual persons and counts. Weakly-supervised methods can avoid detailed labeling and only require counts as annotations of images, but existing methods fail to achieve satisfactory performance because a global perspective field and multi-level information are usually ignored. We propose a weakly-supervised method, DTCC, which effectively combines multi-level dilated convolution and transformer methods to realize end-to-end crowd counting. Its main components include a recursive swin transformer and a multi-level dilated convolution regression head. The recursive swin transformer combines a pyramid visual transformer with a fine-tuned recursive pyramid structure to capture deep multi-level crowd features, including global features. The multi-level dilated convolution regression head includes multi-level dilated convolution and a linear regression head for the feature extraction module. This module can capture both low- and high-level features simultaneously to enhance the receptive field. In addition, two regression head fusion mechanisms realize dynamic and mean fusion counting. Experiments on four well-known benchmark crowd counting datasets (UCF_CC_50, ShanghaiTech, UCF_QNRF, and JHU-Crowd++) show that DTCC achieves results superior to other weakly-supervised methods and comparable to fully-supervised methods.
Publisher
Springer Science and Business Media LLC
Subject
Artificial Intelligence,Computer Graphics and Computer-Aided Design,Computer Vision and Pattern Recognition
Reference55 articles.
1. Li, M.; Zhang, Z. X.; Huang, K. Q.; Tan, T. N. Estimating the number of people in crowded scenes by MID based foreground segmentation and head-shoulder detection. In: Proceedings of the 19th International Conference on Pattern Recognition, 1–4, 2008.
2. Wu, B.; Nevatia, R. Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors. International Journal of Computer Vision Vol. 75, No. 2, 247–266, 2007.
3. Lempitsky, V. S.; Zisserman, A. Learning to count objects in images. In: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Vol. 1, 1324–1332, 2010.
4. Lecture Notes in Computer Science;E Walach,2016
5. Wang, C.; Zhang, H.; Yang, L.; Liu, S.; Cao, X. C. Deep people counting in extremely dense crowds. In: Proceedings of the 23rd ACM International Conference on Multimedia, 1299–1302, 2015.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. CrowdGraph: Weakly supervised Crowd Counting via Pure Graph Neural Network;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-22
2. A Survey on Regression-Based Crowd Counting Techniques;Information Technology and Control;2023-09-26