(MP)2T: Multiple People Multiple Parts Tracker
Paper: | |
Contact: |
We present a method for multi-target tracking that exploits the persistence in detection of object parts. While the implicit representation and detection of body parts have recently been leveraged for improved human detection, ours is the first method that attempts to temporally constrain the location of human body parts with the express purpose of improving pedestrian tracking. We pose the problem of simultaneous tracking of multiple targets and their parts in a network flow optimization framework and show that parts of this network need to be optimized separately and iteratively, due to inter-dependencies of node and edge costs. Given potential detections of humans and their parts separately, an initial set of pedestrian tracklets is first obtained, followed by explicit tracking of human parts as constrained by initial human tracking. A merging step is then performed whereby we attempt to include part-only detections for which the entire human is not observable. This step employs a selective appearance model, which allows us to skip occluded parts in description of positive training samples. The result is high confidence, robust trajectories of pedestrians as well as their parts, which essentially constrain each other’s locations and associations, thus improving human tracking and parts detection. We test our algorithm on multiple real datasets and show that the proposed algorithm is an improvement over the state-of-the-art.
Highlights of the paper:
- Simultaneously accomplish multiple people and multiple parts tracking.
- Not utilize DPM part results, but use dense parts to postpone the person-track association until the temporal consistency of the parts is reached.
- Take advantage of as few as a single part’s trajectory, to infer the presence/absence and approximate trajectory of the whole person under severe occlusions.
Proposed Framework and Key Steps
We propose a framework which attempts to solve the pedestrian tracking problem by simultaneously constraining the detection, temporal persistence, and appearance similarity of the detected humans, as well as the observable or inferable parts they are composed of. This method follows three key steps:
- Step 1: Associate pedestrian detections to obtain several short tracklets.
- Step 2: Associate part detections by computing the likelihood leveraging pedestrian tracklets. Revert pedestrian associations that do not conform to part tracklets.
- Step 3: Perform simultaneous association between all tracklets jointly.
Given a video, we begin by applying a state-of-the-art human detector to detect pedestrians in all frames and also obtain the detector confidence. We then create a flow network for these detections. The successive shortest paths algorithm is then used to obtain tracklets for each set of pedestrian detections. Due to problems inherent in the surveillance task, including mis-detections, merged detections, occlusions, clutter, and false positives, these tracklets are less than ideal. They break at points of occlusion and mis-detections, while merged detections and false positives allow connections between unrelated shorter tracklets.In order to break the wrong associations, we attempt to leverage the temporal persistence of pedestrian parts. We therefore first perform explicit tracking of these parts.
For part tracking, we take the detections for all parts within the spatiotemporal tubes representing a pedestrian tracklet. Again, we employ the k-shortest path algorithm, for which we need the node and edge weights in the flow network of body parts. The following figure demonstrates a vivid part tracking.
The following figure shows Gaussian models of relative part locations.
We use pedestrian-specific appearance modeling to feed the SVM model and we shows three different models for pedestrian appearance learning in the following figure. The first row demostrates a simple average over the entire person bounding box, the middle row uses detected DPM parts, and our tracked parts are used for examples in the bottom row. Using the entire person bounding box the model is very vague. Since a certain minimum number of parts are always detected in DPM, the model contains background when the person is partially occluded. The handling of occlusion and clutter for parts while temporally aligning them, makes our model more accurate.
Merging of Pedestrian and Part Tracklets
Finally, we employ pedestrian parts to merge the tracklets into correct, high confidence trajectories, again by leveraging the flow network optimization algorithm. The entire pedestrian tracklets will now act as nodes, whose costs will be the output of the previous network flow (chosen nodes and edges). We observe that two tracklets to be associated may have a temporal gap due to problems with pedestrian detections within the gap. The figure shown below illustrates the final step of merging pedestrian tracklets.
We evaluated our method using four challenging datasets: Town Center and PETS2009 datasets which are publicly available, and two new datasets: the Parking Lot and Airport sequences.
The quantitative comparisons are show in the following table, followed by specific frames containing the results of detections and trackings in these four datasets.
Data sets | Method | MOTP | MOTA | Prec | Rec |
---|---|---|---|---|---|
Town Center | |||||
Benfold & Reid | 80.3 | 61.3 | 82 | 79 | |
Yamaguchi | 70.9 | 63.3 | 71.1 | 64 | |
Pellegrini | 70.7 | 63.4 | 70.8 | 64.1 | |
Zhang | 71.5 | 65.7 | 71.5 | 66.1 | |
Leal-Taixe | 71.5 | 67.3 | 71.6 | 67.6 | |
Our baseline | 68.8 | 63.5 | 84.9 | 78.9 | |
Proposed | 71.6 | 75.7 | 93.6 | 81.8 | |
PETS 2009 | |||||
Breitenstein | 59 | 74 | 89 | 60 | |
Berclaz | 62 | 78 | 78 | 62 | |
Conte | 57 | 81 | 85 | 58 | |
Berclaz | 52 | 83 | 82 | 53 | |
Alahi | 52 | 83 | 69 | 53 | |
Our baseline | 73.7 | 84.6 | 96.8 | 93.2 | |
Proposed | 76 | 90.7 | 96.8 | 95.2 | |
Parking Lot | |||||
Our baseline | 72.5 | 83.5 | 92.6 | 95.1 | |
Proposed | 77.5 | 88.9 | 93.6 | 96.5 | |
Airport | |||||
Our baseline | 67.7 | 32.7 | 76.5 | 54.9 | |
Proposed | 67.9 | 46.6 | 89.9 | 55.4 |
Moreover, we explicitly ground truthed human parts and quantified our results as follows.
PartID | Method | MOTP | MOTA | Prec | Rec |
---|---|---|---|---|---|
Part1 (Head) | |||||
HOG detection | 45.8 | – | 35 | 52.7 | |
Benfold & Reid | 50.8 | 45.4 | 73.8 | 71 | |
Our tracking | 55.4 | 62.1 | 87.5 | 73.1 | |
Part2 (Left shoulder) | |||||
HOG detection | 42.5 | – | 29.8 | 43.6 | |
Our tracking | 56.8 | 44.2 | 76.9 | 64.2 | |
Part3 (Right shoulder) | |||||
HOG detection | 42.1 | – | 37.3 | 54.6 | |
Our tracking | 61.2 | 59.9 | 86.2 | 72.2 | |
Part4 (Left bottom torso) | |||||
HOG detection | 35.5 | – | 8 | 11.7 | |
Our tracking | 5.2 | 56.6 | 84.6 | 70 | |
Part5 (Right bottom torso) | |||||
HOG detection | 26.2 | – | 6.4 | 9.4 | |
Our tracking | 52.7 | 56.2 | 84.3 | 69.7 | |
Part6 (Upper legs) | |||||
HOG detection | 34.9 | – | 13.7 | 20 | |
Our tracking | 49.1 | 50.8 | 82.6 | 65.2 | |
Part7 (Left lower leg) | |||||
HOG detection | 31.7 | – | 31.9 | 46.7 | |
Our tracking | 59.4 | 47.6 | 80.3 | 63.9 | |
Part8 (Right lower leg) | |||||
HOG detection | 26.5 | – | 28.4 | 41.6 | |
Our tracking | 56 | 37.5 | 74.8 | 57.5 |
Hamid Izadinia, Imran Saleemi, Wenhui Li and Mubarak Shah, “(MP)2T: Multiple People Multiple Parts Tracker”, European Conference on Computer Vision 2012, Florence, Italy, October 7-13, 2012. [PDF][PPT(203MB)][BibTex]
The following specifies the part annotation for Towncener dataset:
ReadMe file (1KB): ReadMe file
Parts annotations (547KB): Parts annotations
Demo of parts annoatations(12.4MB): Video showing overlaid annotated parts