Abstract
The goal of crowd-counting techniques is to estimate the number of people in an image or video in real-time and accurately. In recent years, with the development of deep learning, the accuracy of the crowd-counting task has been improving. However, the accuracy of the crowd-counting task in crowded scenes with large-scale variations still needs to be improved. To cope with this situation, this paper proposes a new novel of crowd-counting network: Context-Scaled Fusion Network(CSFNet). The details include (1) the design of the Multi-Scale Receptive Field Fusion Module (MRFF Module), which employs multiple dilated convolutional layers with different dilatation rates and uses a fusion mechanism to obtain multi-scale hybrid information to generate higher quality feature maps; (2) The Contextual Space Attention Module ( CSA Module) is proposed, which can obtain pixel-level contextual information and combine it with the attention map to enable the model to autonomously learn and pay attention to the important regions to achieve the effect of reducing the counting error. In this paper, the model is trained and evaluated on five datasets: ShanghaiTech, UCF_CC_50, WorldExpo'10, BEIJING-BRT, and Mall. The experimental results show that CSFNet outperforms many SOTA methods on these datasets, demonstrating its superior counting ability and robustness.