Part-based Multiple-Person Tracking
Single camera-based multiple-person tracking is often hindered by difficulties such as occlusion and changes in appearance. In this paper, we address such problems by proposing a robust part-based tracking-by-detection framework. Human detection using part models has become quite popular, yet its extension in tracking has not been fully explored. Our approach learns part-based person-specific SVM classifiers which capture the articulations of the human bodies in dynamically changing appearance and background. With the part-based model, our approach is able to handle partial occlusions in both the detection and the tracking stages. In the detection stage, we select the subset of parts which maximizes the probability of detection, which significantly improves the detection performance in crowded scenes. In the tracking stage, we dynamically handle occlusions by distributing the score of the learned person classifier among its corresponding parts, which allows us to detect and predict partial occlusions, and prevent the performance of the classifiers from being degraded. Extensive experiments using the proposed method on several challenging sequences demonstrate state-of-the-art performance in multiple-people tracking.
Our tracking framework consists of the steps illustrated in figure 3. First, we use an extended part-based human detector on every frame and extract the part features from all detections. Person-specific SVM classifiers are trained using the detections, and consequently used to classify the new detections. We use a greedy bipartite algorithm to associate the detections with the trajectories where the association is evaluated using three affinity terms: position, size, and the score of the person-specific classifier. Additionally, during tracking, we reason about the partial occlusion of a person using a dynamic occlusion model. In particular, partial occlusions are learned by examining the contribution of each individual part through a linear SVM. This inferred occlusion information is used in two ways: First, the classifier is adaptively updated with only the non-occluded parts, which prevents from being degraded along the occlu-
sion period. Second, the discovered occlusion information is passed to the next frame in order to penalize the contribution of the occluded parts when applying the person classifier.
We employ the deformable part-based model for human detection. However, such detector suffers when the human is occluded. In particular, the final score is computed from all the parts, without considering that some parts can often be occluded. To address this problem, we propose to infer occlusion information from the scores of the parts and consequently utilize only the parts with high confidence in their emergence. Instead of aggregating the scores from the set of all the parts, we select the subset of parts, which maximizes the detection score
where b is a bias term, s(pi) is the score of part i, |Sm | is the set cardinality, and the sigmoid function is introduced to normalize the scores of the parts. The parameters A and B are learned by the sigmoid fitting approach. Note that this equation corresponds to the average score of the parts in the subset. Since the average is sensitive to outliers, it is useful in capturing miss-detected parts. Therefore, by maximizing this equation we obtain the most reliable set of parts and its corresponding probability of detection, which we use as the final detection score. We consider only three possible subsets of parts, namely, head only, upper body parts, and all body parts. We found such subsets representative enough for most realistic scenarios.
In this figure, the left is Human detection results using traditional Deformable Part-based Model. The right is human detection results using our approach where red boxes show the humans detected as full bodies, green boxes show the humans detected as upper bodies, and yellow boxes show the humans detected as heads only. It is clear that traditional DPM failed to detect occluded humans since it does not have an explicit occlusion model, while our approach detects the occluded parts and excludes them from the total detection scores, thus achieving significant improvements especially in crowded scenes.
If a partially occluded person is detected and associated to a trajectory, the classifier will be updated with noise and its performance will gradually degrade. Therefore, we employ an occlusion reasoning method to handle this problem. We first infer which parts are occluded for those detections with low classifier score. The part with a negative score is mostly likely to be occluded. Therefore, we adaptively update the classifier by only extracting features from the parts with high confidence which are likely to correspond to the non-occluded parts, while the features for the occluded parts are obtained from the feature vectors of the previous frames. Using this technique, the occluded parts will not be included in updating the classifier. On the other hand, the occlusions are highly correlated in the adjacent frames of videos. Hence, when a partially occluded part is detected in one frame, it will have a high probability of being occluded in the consecutive frames. We harness such smoothness in occlusion to improve the classification performance by introducing an occlusion prediction method into the data association in order to improve the accuracy.
This figure shows examples results of our dynamic occlusion handling approach. Top row shows the original image, and bottom row shows the detected humans are their corresponding parts, where the occluded parts shown in red.
We extensively experimented on the proposed method using Oxford Town Center data set, and two new data sets that we collected; the Parking Lot data set, and the Airport data set. The experimental data sets provide a wide range of significant challenges including occlusion, crowded scenes, and cluttered background. In all the sequences, we only use the visual information and do not use any scene knowledge such as the camera calibration or the static obstacles. It is important to notice that we selected the aforementioned data sets since they include high quality imagery which is more suitable to our approach since the part-based model requires detailed body information.
This figure compares the tracking performances with different features. We evaluate our tracking results using the standard CLEAR MOT metrics. TA measures false negatives, false positives, and ID-switches. Therefore TA has been widely accepted as the main gauge of performance of the tracking methods.
PNNL_Parking lot sequence is a modestly crowded scene including groups of pedestrians walking in queues. The challenges in this data set include long-term inter-objects occlusions, camera jittering, and similarity of appearance among the humans in the scene. This sequence consists of 1,000 frames of a relatively crowded scene with up to 14 pedestrians. The frame resolution in this data set is 1920 X 1080, and the frame rate of 29 fps.
To download the data set click here.
Guang Shu, Afshin Dehghan, Omar Oreifej, Emily Hand and Mubarak Shah, Part-based Multiple-Person Tracking with Partial Occlusion Handling, Computer Visiona and Pattern Recognition 2012, Providence, RI, June 16-21, 2012.