Modeling Traffic Patterns for Vision based Surveillance Applications
With the proliferation of wide area video sensor networks, video surveillance, especially in public areas, is gaining importance at an unprecedented rate. From closed-circuit security systems that can monitor individuals at airports, subways, concerts, and densely populated urban areas in general, to video sensor networks blanketing important location within a city, automated vision based surveillance is the tool required for processing these continuous streams of data. Over the years, a major effort in the vision community has been concentrated on developing fully automated surveillance, monitoring and security systems. Such systems have the advantage of providing 24 hour active warning capabilities and are especially useful in the areas of law enforcement, national defense, border control, and airport security. The current systems are efficient and robust in their handling of common issues, such as illumination changes, shadows, weather conditions, and noise in the imaging process. However, most of these systems have short or no memory in terms of observables in the scene. Due to this memory-less behavior, these systems lack the capability of learning the environment parameters and intelligent reasoning based on these parameters. Such learning and reasoning is an important characteristic of all cognitive systems that increases the adaptability and thus the practicality of such systems. A number of studies have provided strong psychological evidence of the importance of context for scene understanding in humans, such as, handling long term occlusions, detection of anomalous behavior, and even improving the existing low-level vision tasks of object detection and tracking.
We argue that over the period of its operation, an intelligent tracking system should be able to learn the scene from its observables and be able to improve its performance based on this model. The high level knowledge necessary to make such inferences derives from domain knowledge, past experiences, as well as scene geometry, learned traffic and target behavior patterns in the area, etc. This argument forms the basis of this project, where we model and learn the scene activity, observed by a static camera. The motion patterns of the objects in the scene are modeled as a multivariate non-parametric probability density function of spatio-temporal variables. Kernel density estimation is used to learn this model in a completely unsupervised fashion, by observing the trajectories of objects over extended periods of time.
The scene model is learned by observing object trajectories over a long period of time. These trajectories may have errors due to clutter and may also be broken due to short and long term occlusions. However, by observing enough tracks, one can acquire a fairly good understanding of the scene, and infer such scene properties and salient features as usually adapted paths, frequently visited areas, occlusion areas, entry / exit points, etc. It is assumed that the tracks of moving objects for training are available. The KNIGHT object detection and tracking system, developed at the UCF Vision Lab, is used for obtaining the tracks. These tracks are then used in a training phase to discover the correlation in the observations by learning the motion pattern model in the form of a multivariate pdf of spatio-temporal parameters (i.e. the joint probability density of pairs of observations of an object occurring within certain time intervals). Kernel density estimation is used to learn the form of this probability density function.
After the learning phase, a unified Markov chain Monte Carlo (MCMC) sampling based framework is used to generate the most likely paths in the scene, to decide whether a given path is an anomaly to the learned model, and to estimate the probability density of the next state of a random walk based on its previous states. These predictions based on the model are then used to improve the detection of foreground objects as well as to persistently track objects through short-term and long-term occlusions.
The work performed in this project is original in the following ways:
- A novel motion model is proposed that not only learns the scene semantics but also the behavior of traffic through arbitrary paths. This learning is not limited like other approaches that work best with well defined paths like roads and walkways.
- The learning is accomplished using a joint five dimensional model unlike pixel-wise models and mixture or chain models. The proposed model represents the joint probability of a transition from any point in the image to any other point, and the time taken to complete that transition.
- The temporal dimension of traffic patterns is explicitly modeled, and is included in the feature vector, thus enabling us to distinguish patterns of activity that correspond to the same trajectory cluster but have high deviation in the temporal dimension. This is a more generalized method as compared to modeling pixel-wise velocities.
- Instead of fitting parametric models to the data, we propose the idea of learning tracks information using Kernel Density Estimation. It allows for a richer model and the density retained at each point in the feature space accurately reflects the training data.
- Rather than exhaustively searching for predictions in the feature space based on their probabilities, we propose to use stochastic methods to sample from the learned distribution and use it as prediction with a computed probability. Sampling is thus used as the process propagation function in our state estimation framework.
- Unlike most of the previous work reported in this section, which is targeted towards one or two similar applications, we apply the proposed probabilistic framework to solve a variety of problems that are commonly encountered in surveillance and scene analysis.
Learning the Transition Distribution using KDE
The object transition model has a single five dimensional feature z = (X, Y, Δt), where X is the two dimensional initial location of the object in image coordinates, Y is the two dimensional final location in image coordinates, and Δt is the time taken to complete the transition in milliseconds. The KNIGHT system outputs object trajectories as a series of observations, where each observation is associated with the location of object centroid and the time at which the object was observed. After obtaining these trajectories, each distinct pair of observations, belonging to the same object, is added to the kernel density estimate as the five dimensional data point z, where the centroid location of object in the first observation becomes X, location in the second observation of the pair becomes Y, and Δt is computed as the time difference between the instants of the two observations. Δt is assumed to be less than or equal to 5000 milliseconds to keep an upper bound on the number of data points per trajectory. Observations, even of the same object, occurring more than 5 seconds apart are assumed to be uncorrelated. Kernel density estimation is used as the learning methodology.
Fig: Two scenes used for testing. Tracks observed during training are shown in blue.
Fig: Maps representing marginal probability of an object, (a) reaching each point in the image, (b) starting from each point in the image, in any duration of time. (c) and (d) show similar maps for a different scene.
Fig: Regions of maps showing probability of reaching any point in the map starting from the point G.
Applications of Proposed Model
After learning, a joint MCMC based framework is used to sample from the model and generate predictions for future locations of objects given the current state. These predictions are then used to attempt the solution of diverse problems that are commonly encountered in surveillance scenarios.
– Generating Likely Tracks
Generation of likely paths is an important aspect of modeling traffic patterns. Given the current location of an object, such a simulated path amounts to a prediction of future behavior of the object. We seek to sample from the learned model of transition patterns to generate behavior predictions. We expect that only a few number of paths should adequately reflect observations of trajectories through walkways and roads, etc. Starting at random initial states in the image, sampling from the distribution gives possible paths that are usually followed by the traffic. The figure below shows some of these random walks. It should be noted that no observations from the tracking algorithm have been used in this experiment. The likely paths generated are purely simulated based on the learned distribution.
Fig: Examples of simulated likely paths generated by the proposed algorithm using Metropolis-Hastings sampling. Tracks are initialized by manually selecting random points.
– Improvement of Foreground Detection
The intensity difference of objects from the background has been a widely used criterion for object detection, but it can be noted that temporal persistence is also an intrinsic property of the foreground objects, i.e. unless an object exits from the scene or becomes occluded, it has to either stay at the same place or move to a location within the spatial vicinity of the current observation. Since our proposed transition model incorporates the probabilities of movement of objects from one location to another, it can be used to improve the foreground models. We now present the formulation for this application. It should be noted however that this method alone cannot be used to model the foreground. Instead it needs to be used in conjunction with an appearance based model like mixture of Gaussians. Essentially, the transition probabilities of objects from one point to another, are used as evidence of temporal persistence of foreground, for each pixel in the foreground blob in the previous time instance.
Fig: Foreground Modeling Results: Columns (a) and (b) show results of object detection. The images in top row are without and bottom row are with using the proposed model. Green and red bounding boxes show true and false detections respectively. (c) and (d) show results of blob tracking. Red tracks are original broken tracks and black ones are after improved foreground modeling.
– Anomaly Detection
If tracking data used to model the state transition distribution spans sufficiently large periods of time, it is obvious that a sampling algorithm will not be able to sample a track that is anomalous considering the usual patterns of activity and motion in that scene. This observation forms the basis of our anomaly detection algorithm. The anomaly detection algorithm generates its own predictions for future states using MCMC sampling, without using the current observation of the tracker. It then compares the actual measurements of objects with the predicted tracks and computes a difference measure between them.
This approach is sufficient to find a sequence of transitions significantly different than the predictions from the state transition distribution, and can easily identify an anomalous event in terms of motion patterns. Using this formulation trajectories that are spatially incoherent, or temporally inconsistent with normal behavior can be identified, e.g., presence of objects in unusual areas or significant speed variation respectively.
(a) (b) (c) (d)
Fig: Results of Anomaly detection: (a) Spatially anomalous, (b) and (c) Temporally anomalous, and (d) Suspicious behavior due to presence over large distance or extended period. Blue track represents the actual (observed) track. Red and black tracks correspond to typical and atypical (anomalous) predicted paths respectively.
– Persistent Tracking through Occlusions
Persistent tracking requires modeling of spatio-temporal and appearance properties of the targets. Traditionally, parametric motion models such as, constant velocity or constant acceleration, are used to enforce spatio-temporal constraints. These models usually fail when the paths adapted by objects are arbitrary. The proposed model of learning traffic parameters is able to handle these shortcomings when occlusions are not permanently present in the scene and the patterns of motion through these occlusions have previously been learned, e.g., person to person occlusions, large objects like vehicles that hide smaller moving objects from view. We use the proposed distribution to describe a solution to these problems.
Essentially, once the tracking algorithms realizes that the tracking of an object has been lost, it starts generating its own predicted locations for that object and continues until the object becomes visible again. This way when the object becomes visible after undergoing an occlusion, the current prediction is much closer to the actual current position, as compared to the last observed position and hence, the solution to the correspondence problem becomes simpler and more robust.
(a) (b) (c)
Fig: For each row, (a) shows the observed tracks in blue and red that have been labelled wrong, and (b) and (c) show the stitched part of tracks in black, and actual tracks in red and blue respectively.
Fig: Example of persistent tracking for multiple simultaneous objects with overlapping or intersecting tracks undergoing occlusion. (Left) Actual original tracks (ground truth) (Right) Broken tracks due to simulated occlusion shown as black region.
Fig: Results for scenario shown in previous figure. Green track is the ground truth. Tracking through occlusion using Kalman filter is shown in white and yellow tracks are generated using the proposed approach. Notice that both methods can recover well once the measurement is available again, but during occlusion the proposed method stays closer to ground truth.