Geometric Constraints for Human Detection in Aerial Imagery
Figure 1. Frames from some of the sequences
Detecting humans in imagery taken from a UAV is a challenging problem due to small number of pixels on target, which makes it more difficult to distinguish people from background clutter, and results in much larger search space. We propose a method for human detection based on a number of geometric constraints obtained from the metadata. Specifically, we obtain the orientation of ground plane normal, the orientation of shadows cast by humans in the scene, and the relationship between human heights and the size of their corresponding shadows. In cases when metadata is not available we propose a method for automatically estimating shadow orientation from image data. We utilize the above information in a geometry based shadow, and human blob detector, which provides an initial estimation for locations of humans in the scene. These candidate locations are then classified as either human or clutter using a combination of wavelet features, and a Support Vector Machine. Our method works on a single frame, and unlike motion detection based methods, it bypasses the global motion compensation process, and allows for detection of stationary and slow moving humans, while avoiding the search across the entire image, which makes it more accurate and very fast. We show impressive results on sequences from the VIVID dataset and our own data, and provide comparative analysis.
Ground-Plane Normal and Shadow Constraints
The imagery obtained from the UAV has the following metadata associated with most of the frames. It has a set of aircraft parameters latitude, longitude, altitude, which define the position of the aircraft in the world, as well as pitch, yaw, roll which define the orientation of the aircraft within the world. Metadata also contains a set of camera parameters scan, elevation, twist which define the rotation of the camera with respect to the aircraft, as well as focal length, and time. We use this information to derive a set of world constraints, and then project them into the original image.
We employ three world constraints.
- The person is standing upright perpendicular to the ground plane.
- The person is casting a shadow.
- There is a geometric relationship between person’s height and the length of their shadow.
Given latitude, longitude, and time, we obtain the position of the sun relative to the observer on the ground. It is defined by the azimuth angle (from the north direction), and the zenith angle (from the vertical direction). Having an assumption for the height of the person in the world, we find the length of the shadow using the zenith angle of the sun. Using the azimuth angle, we find the groundplane projection of the vector pointing to the sun.
Before we can use our world constraints for human detection, we have to transform them from the world coordinates to the image coordinates. To do this we use the metadata to obtain the projective homography transformation that relates image coordinates to the ground plane coordinates. In addition, we compute the ratio between the projected shadow length and the projected person height.
In order to avoid the search over the entire frame, the first step in our human detection process is to constrain the search space of potential human candidates. We define the search space as a set of blobs oriented in direction of shadow, and direction of normal. To do so we utilize the image projection of the world constraints derived previously – the projected orientation of the normal to the ground plane, the projected orientation of the shadow, and the ratio between the projected person height, and projected shadow length. See Figure 3.
Figure 3. This figure illustrates the pipeline of applying image constraints to obtain an initial set of human candidates.
Our next step is to relate the shadow and human blob maps, and to remove shadow-human configurations that do not satisfy the image geometry which we derived from the metadata. We search every shadow blob, and try to pair it up with a potential object blob, if the shadow blob fails to match any object blobs, it is removed. If an object blob never gets assigned to a shadow blob it is also removed.
Wavelets have been shown to be useful in extracting distinguishing features from imagery. So in the final step of our method, we classify each object candidate as either a human or non-human using a combination of wavelet features and SVM (Figure 5). We chose wavelet features over HOG because we obtained higher classification rate on a validation set. We suspect that this is due to the fact that in the case of HOG, the small size of chips does not allow for the use of optimal overlapping grid parameters, giving too coarse sampling. We apply Daubechies 2 wavelet filter to each chip.
Qualitative evaluation is done on sequences from VIVID3 and 4 as well as some of our own data. The data contains both stationary and moving vehicles and people, as well as various clutter in the case of VIVID4. Vehicles cast a shadow, and are usually detected as candidates, these are currently filtered out in the classification stage. For quantitative evaluation we evaluated our detection methods on three sequences from the DARPA VIVID3 dataset of 640×480 resolution, and compared the detection against manually obtained groundtruth. We used the Recall vs False Positives Per Frame (FPPF) evaluation criteria. To evaluate the accuracy of the geometry based human candidate detector method, we require the centroid of the object candidate blob to be within w pixels of the centroid blob, where w is 15.
Figure 6 compares ROC curves for our geometry based method with and without the use of object-shadow relationship refinement, and centroid localization, conventional full frame detection method, and standard motion detection pipeline of registration, detection, and tracking. Figure 7 shows qualitative detection results.
Power Point Presentation (.pptx) (14.9 MB)
Vladimir Reilly, Berkan Solmaz and Mubarak Shah, Geometric Constraints for Human Detection in Aerial Imagery, The 11th European Conference on Computer Vision (ECCV), 2010.