UCF – QNRF – A Large Crowd Counting Data Set
Automatic counting and localizing in dense crowd scenes has significant importance from socio-political and safety perspective. Crowds gather around the world in a variety of scenarios and counting the number of participants is often an important matter of concern for the organizers and the law enforcement agencies.
Figure 1: Six images from our dataset
Data Set Details
We introduce the largest dataset to-date (in terms of number of annotations) for training and evaluating crowd counting and localization methods. It contains 1535 images which are divided into train and test sets of 1201 and 334 images respectively. Our dataset is most suitable for training very deep Convolutional Neural Networks (CNNs) since it contains order of magnitude more annotated humans in dense crowd scenes than any other available crowd counting dataset. Summary of our dataset statistics and comparison with others is presented in Table 1 while Figure 1 shows six images randomly selected from our dataset.
Table 1: Comparison of dataset statistics
The UCF-QNRF dataset has the most number of high-count crowd images and annotations, and a wider variety of scenes containing the most diverse set of viewpoints, densities and lighting variations. The resolution is large compared to WorldExpo10 and ShanghaiTech. The average density, i.e., the number of people per pixel over all images is also the lowest, signifying high-quality large images. Lower per-pixel density is partly due to inclusion of background regions, where there are many high-density regions as well as zero-density regions. Part A of Shanghai dataset has high-count crowd images as well, however, they are severely cropped to contain crowds only. On the other hand, the new UCF-QNRF dataset contains buildings, vegetation, sky and roads as they are present in realistic scenarios captured in the wild. This makes this dataset more realistic as well as difficult.
Moreover, since we collected our dataset from the web and not from surveillance camera videos or simulated crowd scenes, it is very diverse in terms of prepectivity, image resolution, crowd density and the scenarios which a crowd exist. We also took special care to ensure that images in the dataset come from all parts of the world. Figure 2 shows the geo-tags of images in our dataset, marked on the world map.
Figure 2: Locations of images in our dataset
Similarly, Figure 3(a) shows the diversity in counts among the datasets. The distribution of dataset is similar to UCF_CC_50, however, the new dataset is 30 and 20 times larger in terms of number of images and annotations, respectively, compared to UCF_CC_50. Furthermore, the resolution is large compared to WorldExpo10 and ShanghaiTech, as can be seen in Figure 3(b). We hope the new dataset will significantly increase research activity in visual crowd analysis and will pave way for building deployable practical counting and localization systems for dense crowds.
Figure 3: Count distribution in our dataset
H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maddeed, N. Rajpoot, M. Shah, Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds, in Proceedings of IEEE European Conference on Computer Vision (ECCV 2018), Munich, Germany, September 8-14, 2018.
Q: What are the four distance thresholds used in Table 5?
A: 10, 25, 35 ,50 pixels. Note that although Sect. 5 describes using 1,2,…,100 as thresholds, in practice we used these four values for computing values of precision and recall.
Q: What is the value of \tau in Sec 3.1?