Counting in Extremely Dense Crowd Images
We propose to leverage multiple sources of information to compute an estimate of the number of individuals present in an extremely dense crowd visible in a single image. Due to problems including perspective, occlusion, clutter, and few pixels per person, counting by human detection in such images is almost impossible. Instead, our approach relies on multiple sources such as low confidence head detections, repetition of texture elements (using SIFT), and frequency-domain analysis to estimate counts, along with confidence associated with observing individuals, in an image region. Secondly, we employ a global consistency constraint on counts using Markov Random Field. This caters for disparity in counts in local neighborhoods and across scales. We tested our approach on a new dataset of fifty crowd images containing 64K annotated humans, with the head counts ranging from 94 to 4543. This is in stark contrast to datasets used for existing methods which contain no more than tens of individuals. We experimentally demonstrate the efficacy and reliability of the proposed approach by quantifying the counting performance.
Crowds occur in a variety of situations, for instance, concerts, political speeches, rallies, marathons, and in stadiums. Crowd counting or Density Estimation helps in management of crowds for safety and surveillance such as deployment of law enforcement personnel and unusual behavior detection. It is also helpful in finding the volume of commuters which can be important for development of public transportation infrastructure. Furthermore, it can be used to gauge political significance of rallies or protests, as conflicting estimates are often reported for the same event. And since in many cases counting through turnstiles or counting by humans is not possible or is too cumbersome, we need to resort to Computer Vision based approaches to get counting estimates for dense crowds.
The existing datasets used in Computer Vision are low-to medium density and use temporal information in the form of videos. Examples are:
– UCSD: 11-46 per frame
– Mall: 13-53 per frame
– PETS: 3-40 per frame
In this work, we restrict ourselves to high density crowds in still images. Some of the images, with annotated points, are shown below:
Since this is a 2D counting problem, it can be modeled using Spatial Poisson Counting Process. However, perspective effects in real images make this process non-homogenous, i.e., the parameter lambda is not constant and is a function of spatial co-ordinates. In order to alleviate this issue, we divide the image into small patches, so that the process becomes homogenous. Then, the expected count in each patch is just the predicted lambda times the size of the patch. Dense crowd images are inherently difficult, so one method to predict lambda does not always yield good results. We, therefore, use multiple sources of information to predict the counts in each patch. Furthermore, if we assume independence among counts in patches, then the count for the entire image is just the sum of counts from individual patches. Although the independence assumption making counting in images a simple task, it is not necessarily correct since there is a strong dependence in counts of neighboring patches. Hence, we first compute counts in individual patches and then place them in Markov Random Field with smoothness prior to compute counts in the entire image. This framework is shown in the figure below:
The simplest approach to estimate counts is through human detections. However, a quick glance at images of dense crowds reveals that the bodies are almost entirely occluded, leaving only heads for counting and analysis. We, therefore, used Deformable Parts Model trained on INRIA Person dataset, and applied only the filter corresponding to head to the images. Often, the heads are partially occluded, so we used a much lower threshold for detection. There are many false negatives and positives since the images are inherently difficult (see figure below).
When a crowd image contains thousands of individuals, with each individual occupying only tens of pixels, especially those far away from the camera in an image with perspective distortion, histograms of gradients (which are employed for head detection) do not impart any useful information. However, a crowd is inherently repetitive in nature, since all humans appear the same from a distance. The repetitions, as long as they occur consistently in space, i.e., crowd density in the patch is uniform, can be captured by Fourier Transform where the periodic occurrence of heads shows as peaks in the frequency domain.
The first column in the figure below shows three original patches, the second column shows gradient image, while the third column shows the corresponding reconstructed patches. The positive correlation is evident from the number of local maximas in the reconstructed patch, and the ground truth counts shown in the last column.
Finally, we use interest points not only to estimate counts but also to get a confidence whether the patch represents crowd or not. Since sky, buildings and trees naturally occur in outdoor images, and the fact that head detection gives false positives in such regions and Fourier analysis is crowd-blind, it is important to discard counts from such patches. For both counting and confidence, we obtain SIFT features, and cluster them into a codebook of fixed size. In order to obtain counts or densities using sparse SIFT features, we use Support Vector Regression using the counts computed at each patch from ground truth. Furthermore, due to sparse nature of SIFT features; the frequency of a particular feature in a patch can also be modeled as a Poisson R.V. Given a set of positive and negative examples, the relative densities (frequencies normalized by area) of the feature vary in positive and negative images, and can be used to identify crowd patches from non-crowd ones. We use this to find the confidence whether a particular patch depicts crowd or not. In the figure below, the images in the first column have confidence of crowd likelihood given in second column and ground truth counts in third column. In the top image, the gap between stadium tiers gets low confidence of crowd presence. Similarly, patches containing the sky and flood lights in the bottom image have low probability of crowd.
For learning and fusion at the patch level, we densely sample overlapping patches from the training images and using the annotation, obtain counts for the corresponding patches. Computing counts and confidences from the three sources, we scale individual features and regress using epsilon-SVR, with the counts computed from the annotations.
In order to impose smoothness among counts from different patches, we place them in an MRF framework with grid structure. Furthermore, although small patches have consistent density, they have fewer repetitions or periods and can easily be affected by low-frequency noise. Larger patches, if they have consistent density, have more people, and therefore more periods and better relevant-to-irrelevant frequency ratio. Moreover, it is difficult to ascertain in advance the right scale for analysis for a particular image. This problem lends itself to a multi-scale MRF, an example of which is shown in the figure below. The graph can be represented with (V,E) and N are the four neighbors at the same level and intermediate nodes that connect a patch to layers above and below it. Note that, this multi-scale MRF is different from other hierarchical models used for images, in that the data term (unary cost) for a patch is evaluated independent of the patches at layers above and below it, whereas in image restoration and stereo, data cost for patch at higher level is computed from layer directly below. The patches in each layer have independent data terms, thus require a simultaneous solution for all layers. There exists a relationship between patches in adjacent layers, i.e., the sum of counts 2×2 patches below should be equal to the sum of count of patch above. While inferring, we try to maintain this relationship.
We collected the dataset from publicly available web images, including Flickr. As mentioned in the introduction, it consists of 50 images with counts ranging between 94 and 4543 with an average of 1280 individuals per image. Much like the range of counts, the scenes in these images also belong to a diverse set of events: concerts, protests, stadiums, marathons, and pilgrimages. One of the images is a painting while another is an abstract depiction of a crowd. Using a simple tool for marking the ground truth positions of individuals, we obtained 63705 annotations in the fifty images. For experiments, we randomly divided the dataset into sets of 10, reduced the maximum dimension to 1024 for computational efficiency, and performed 5-fold cross-validation. We used two simple measures to quantify the results: mean and deviation of Absolute Difference (AD), and mean and deviation of Normalized Absolute Difference (NAD), which was obtained by normalizing the absolute difference with the actual count for each image.
Figure: Quantitative results of the proposed approach and comparison with Rodriguez et al. (ICCV 2011) and Lempitsky and Zisserman (NIPS 2010) using mean and standard deviation of Absolute Difference and Normalized Absolute Difference from ground truth. The influence of the individual sources is also quantified.
Figure: This figure shows analysis of patch estimates in terms of absolute and normalized absolute differences. The x-axis shows image number sorted with respect to actual count. Means are shown in black asterisk, and standard deviations with red bars.
Figure: Analysis of comparison: Bars and lines in red depict results of Rodriguez et al., green show those of Lempitsky and Zisserman, while blue shows the results using proposed approach, while ground truth is shown in black. The graph on left shows Normalized Absolute Difference (an error measure) and the graph on right shows the actual and estimated counts.
Figure: Original images are shown with counts from individual sources, as well as results of fusion and MRF (Proposed) and ground truth. In each of the images, the best estimate is provided by three different sources, alluding to the complementary nature of these sources.
Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah, Multi-Source Multi-Scale Counting in Extremely Dense Crowd Images, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, Oregon, June 25-27, 2013.