Re-identification of Humans in Crowds using Personal, Social and Environmental Constraints
This paper addresses the problem of human re-identification across non-overlapping cameras in crowds. Re-identification in crowded scenes is a challenging problem due to large number of people and frequent occlusions, coupled with changes in their appearance due to different properties and exposure of cameras. To solve this problem, we model multiple Personal, Social and Environmental (PSE) constraints on human motion across cameras. The personal constraints include appearance and preferred speed of each individual assumed to be similar across the non-overlapping cameras. The social influences (constraints) are quadratic in nature, i.e. occur between pairs of individuals, and modeled through grouping and collision avoidance. Finally, the environmental constraints capture the transition probabilities between gates (entrances / exits) in different cameras, defined as multi-modal distributions of transition time and destination between all pairs of gates. We incorporate these constraints into an energy minimization framework for solving human re-identification. Assigning 1 − 1 correspondence while modeling PSE constraints is NP-hard. We present a stochastic local search algorithm to restrict the search space of hypotheses, and obtain 1 − 1 solution in the presence of linear and quadratic PSE constraints. Moreover, we present an alternate optimization using Frank-Wolfe algorithm that solves the convex approximation of the objective function with linear relaxation on binary variables, and yields an order of magnitude speed up over stochastic local search with minor drop in performance. We evaluate our approach using Cumulative Matching Curves as well 1 − 1 assignment on several thousand frames of Grand Central, PRID and DukeMTMC datasets, and obtain significantly better results compared to existing re-identification methods.
Traditionally, re-identification has been primarily concerned with matching static snapshots of people from multiple cameras. Although there have been a few works that modeled social effects for re-identification such as grouping behavior, they mostly deal with static images. In this paper, we study the use of time and video information for this task, and propose to consider the dynamic spatio-temporal context of individuals and the environment to improve the performance of human reidentification. The primary contribution of our work is to explicitly address the influence of personal goals, neighboring people and environment on human re-identification through high-order relationships. We complement appearance, typically employed for re-identification, with multiple personal, social and environmental (PSE) constraints. This figure compares the results for using only visual information versus using the PSI constraints. Due to the
Transition time and destination improvement:
PSE Constraints effects:
Grand Central Dataset:
Grand Central is a dense crowd dataset that is particularly challenging for the task of human re-identification. The dataset contains 120,000 frames, with a resolution of 1920×1080 pixels. Recently, Yi et al.  used a portion of the dataset for detecting stationary crowd groups. They released annotations for trajectories of 12,684 individuals for 6,000 frames at 1.5 fps. We rectified the perspective distortion from the camera and put bounding boxes at correct scales using the trajectories provided by . However, location of annotated points were not consistent for any single person, or across different people. Consequently, we manually adjusted the bounding boxes for 1,500 frames at 1.5 fps, resulting in ground truth for 17 minutes of video data. We divide the scene into three horizontal sections, where two of them become separate cameras and the middle section is treated as invisible or unobserved region. The locations of people in each camera are in independent coordinate systems. The choice of dividing the scene in this way is meaningful, as both cameras have different illuminations due to external lighting effects, and the size of individuals is different due to perspective effects. Furthermore, due to the wide field of view in the scene, there are multiple entrances and exits in each camera, so that a person exiting the first camera at a particular location has the choice of entering from multiple different locations.
Recently, the DukeMTMC dataset was released to quantify and evaluate the performance of multi-target, multi-camera tracking systems. It is high resolution 1080p, 60fps dataset and includes surveillance footage from 8 cameras with approximately 85 minutes of videos for each camera. There are cameras with both overlapping and non-overlapping fields-of-view. The dataset is of low density with 0 to 54 people per frame. Since only the ground truth for training set has been released so far, which constitutes first 50 minutes of video for each camera, we report performance on the training set only. Cameras 2 and 5 which are disjoint, and have the most number of people (934 in total, with 311 individuals appearing in both cameras), were selected for experiments. To remain consistent with the other datasets, we perform evaluation in terms of Cumulative Matching Curves (CMC) and F-Score on 1-1 assignment.
Email me for the source code and Grand Central annotations. s…@gmail.com
Shayan Modiri Assari, Haroon Idrees, and Mubarak Shah, Human Re-identification in Crowd Videos Using Personal, Social and Environmental Constraints, European Conference on Computer Vision. Springer International Publishing, 2016.. [Pdf] [BibTeX]
Shayan Modiri Assari, Haroon Idrees, and Mubarak Shah, Re-identification of Humans in Crowds using Personal, Social and Environmental Constraints, arXiv preprint arXiv:1612.02155 [Pdf] [BibTeX]