Skip to main content

(MP)2T: Multiple People Multiple Parts Tracker

Introduction

We present a method for multi-target tracking that exploits the persistence in detection of object parts. While the implicit representation and detection of body parts have recently been leveraged for improved human detection, ours is the first method that attempts to temporally constrain the location of human body parts with the express purpose of improving pedestrian tracking. We pose the problem of simultaneous tracking of multiple targets and their parts in a network flow optimization framework and show that parts of this network need to be optimized separately and iteratively, due to inter-dependencies of node and edge costs. Given potential detections of humans and their parts separately, an initial set of pedestrian tracklets is first obtained, followed by explicit tracking of human parts as constrained by initial human tracking. A merging step is then performed whereby we attempt to include part-only detections for which the entire human is not observable. This step employs a selective appearance model, which allows us to skip occluded parts in description of positive training samples. The result is high confidence, robust trajectories of pedestrians as well as their parts, which essentially constrain each other’s locations and associations, thus improving human tracking and parts detection. We test our algorithm on multiple real datasets and show that the proposed algorithm is an improvement over the state-of-the-art.

Highlights of the paper:

  • Simultaneously accomplish multiple people and multiple parts tracking.
  • Not utilize DPM part results, but use dense parts to postpone the person-track association until the temporal consistency of the parts is reached.
  • Take advantage of as few as a single part’s trajectory, to infer the presence/absence and approximate trajectory of the whole person under severe occlusions.

Proposed Framework and Key Steps

We propose a framework which attempts to solve the pedestrian tracking problem by simultaneously constraining the detection, temporal persistence, and appearance similarity of the detected humans, as well as the observable or inferable parts they are composed of. This method follows three key steps:

  • Step 1: Associate pedestrian detections to obtain several short tracklets.
  • Step 2: Associate part detections by computing the likelihood leveraging pedestrian tracklets. Revert pedestrian associations that do not conform to part tracklets.
  • Step 3: Perform simultaneous association between all tracklets jointly.

Pedestrian Tracklets

Given a video, we begin by applying a state-of-the-art human detector to detect pedestrians in all frames and also obtain the detector confidence. We then create a flow network for these detections. The successive shortest paths algorithm is then used to obtain tracklets for each set of pedestrian detections. Due to problems inherent in the surveillance task, including mis-detections, merged detections, occlusions, clutter, and false positives, these tracklets are less than ideal. They break at points of occlusion and mis-detections, while merged detections and false positives allow connections between unrelated shorter tracklets.In order to break the wrong associations, we attempt to leverage the temporal persistence of pedestrian parts. We therefore first perform explicit tracking of these parts.

Part Tracking

For part tracking, we take the detections for all parts within the spatiotemporal tubes representing a pedestrian tracklet. Again, we employ the k-shortest path algorithm, for which we need the node and edge weights in the flow network of body parts. The following figure demonstrates a vivid part tracking.

The following figure shows Gaussian models of relative part locations.

Part-based Appearance Model

We use pedestrian-specific appearance modeling to feed the SVM model and we shows three different models for pedestrian appearance learning in the following figure. The first row demostrates a simple average over the entire person bounding box, the middle row uses detected DPM parts, and our tracked parts are used for examples in the bottom row. Using the entire person bounding box the model is very vague. Since a certain minimum number of parts are always detected in DPM, the model contains background when the person is partially occluded. The handling of occlusion and clutter for parts while temporally aligning them, makes our model more accurate.

Merging of Pedestrian and Part Tracklets

Finally, we employ pedestrian parts to merge the tracklets into correct, high confidence trajectories, again by leveraging the flow network optimization algorithm. The entire pedestrian tracklets will now act as nodes, whose costs will be the output of the previous network flow (chosen nodes and edges). We observe that two tracklets to be associated may have a temporal gap due to problems with pedestrian detections within the gap. The figure shown below illustrates the final step of merging pedestrian tracklets.

Experiments and Results

We evaluated our method using four challenging datasets: Town Center and PETS2009 datasets which are publicly available, and two new datasets: the Parking Lot and Airport sequences.

The quantitative comparisons are show in the following table, followed by specific frames containing the results of detections and trackings in these four datasets.

 

Data sets Method MOTP MOTA Prec Rec
Town Center
Benfold & Reid 80.3 61.3 82 79
Yamaguchi 70.9 63.3 71.1 64
Pellegrini 70.7 63.4 70.8 64.1
Zhang 71.5 65.7 71.5 66.1
Leal-Taixe 71.5 67.3 71.6 67.6
Our baseline 68.8 63.5 84.9 78.9
Proposed 71.6 75.7 93.6 81.8
PETS 2009
Breitenstein 59 74 89 60
Berclaz 62 78 78 62
Conte 57 81 85 58
Berclaz 52 83 82 53
Alahi 52 83 69 53
Our baseline 73.7 84.6 96.8 93.2
Proposed 76 90.7 96.8 95.2
Parking Lot
Our baseline 72.5 83.5 92.6 95.1
Proposed 77.5 88.9 93.6 96.5
Airport
Our baseline 67.7 32.7 76.5 54.9
Proposed 67.9 46.6 89.9 55.4

 

Moreover, we explicitly ground truthed human parts and quantified our results as follows.

 

PartID Method MOTP MOTA Prec Rec
Part1 (Head)
HOG detection 45.8 35 52.7
Benfold & Reid 50.8 45.4 73.8 71
Our tracking 55.4 62.1 87.5 73.1
Part2 (Left shoulder)
HOG detection 42.5 29.8 43.6
Our tracking 56.8 44.2 76.9 64.2
Part3 (Right shoulder)
HOG detection 42.1 37.3 54.6
Our tracking 61.2 59.9 86.2 72.2
Part4 (Left bottom torso)
HOG detection 35.5 8 11.7
Our tracking 5.2 56.6 84.6 70
Part5 (Right bottom torso)
HOG detection 26.2 6.4 9.4
Our tracking 52.7 56.2 84.3 69.7
Part6 (Upper legs)
HOG detection 34.9 13.7 20
Our tracking 49.1 50.8 82.6 65.2
Part7 (Left lower leg)
HOG detection 31.7 31.9 46.7
Our tracking 59.4 47.6 80.3 63.9
Part8 (Right lower leg)
HOG detection 26.5 28.4 41.6
Our tracking 56 37.5 74.8 57.5

 

Related Publication

Hamid Izadinia, Imran Saleemi, Wenhui Li and Mubarak Shah, “(MP)2T: Multiple People Multiple Parts Tracker”, European Conference on Computer Vision 2012, Florence, Italy, October 7-13, 2012. [PDF][PPT(203MB)][BibTex]

Part Annotation

The following specifies the part annotation for Towncener dataset:
ReadMe file (1KB): ReadMe file
Parts annotations (547KB): Parts annotations
Demo of parts annoatations(12.4MB): Video showing overlaid annotated parts

YouTube Video Presentation