Action Localization in Videos through Context Walk
Figure 1: Illustration of the idea: We over-segment videos into supervoxels and learn spatio-temporal relations (yellow arrows) between all the supervoxels to those that lie within the ground truth (green) bounding boxes. In this figure, the arrows show such relations from some of the supervoxels across the video to only one set of supervoxels at a particular temporal location.
This paper presents an efficient approach for localizing actions by learning contextual relations, in the form of relative locations between different video regions. We begin by over-segmenting the videos into supervoxels, which have the ability to preserve action boundaries and also reduce the complexity of the problem. Context relations are learned during training which capture displacements from all the supervoxels in a video to those belonging to foreground actions. Then, given a testing video, we select a supervoxel randomly and use the context information acquired during training to estimate the probability of each supervoxel belonging to the foreground action. The walk proceeds to a new supervoxel and the process is repeated for a few steps. This “context walk” generates a conditional distribution of an action over all the supervoxels. A Conditional Random Field is then used to find action proposals in the video, whose confidences are obtained using SVMs. We validated the proposed approach on several datasets and show that context in the form of relative displacements between supervoxels can be extremely useful for action localization. This also results in significantly fewer evaluations of the classifier, in sharp contrast to the alternate sliding window approaches.
Many existing approaches learn an action detector on trimmed training videos and then exhaustively search for each action through the testing videos. However, with realistic videos having longer durations and higher resolutions, it becomes impractical to use sliding window approach to look for actions or interesting events. Analyzing the videos of datasets used for evaluation of action localization such as UCF-Sports, JHMDB, and THUMOS reveals that, on average, the volume occupied by an action (in pixels) is considerably small compared to the spatio-temporal volume of the entire video (around 17%, using ground truth). Therefore, it is important that action localization is performed through efficient techniques which can classify and localize actions without evaluating at all possible combinations of spatio-temporal volumes.
For the proposed approach, we over-segment the videos into supervoxels and use context as a spatial relation between supervoxels relative to foreground actions. The relations are modeled using three dimensional displacement vectors which capture the intra-action (foreground-foreground) and action-to-scene (background-foreground) dependencies (see Figure 1). These contextual relations are represented by a graph for each video, where supervoxels form the nodes and directed edges capture the spatial relations between them (see Figure 2). During testing, we perform a context walk where each step is guided by the context relations learned during training, resulting in a probability distribution of an action over all the supervoxels.
The proposed approach for action localization begins by over-segmenting the training videos into supervoxels and computing the local features in the videos. For each training video, a graph is constructed that captures relations from all the supervoxels to those belonging to action foreground (ground truth) (see Figure 2). Then, given a testing video, we initialize the context walk with a randomly selected supervoxel and find its nearest neighbors using appearance and motion features. The displacement relations from training supervoxels are then used to predict the location of an action in the testing video. This gives a conditional distribution for each supervoxel in the video of belonging to the action. By selecting the supervoxel with the highest probability, we make predictions about location of the action again and update the distribution. This context walk is executed for several steps and is followed by inferring the action proposals through Conditional Random Field. The confidences for the localized action segments (proposals) are then obtained through Support Vector Machine learned using the labeled training videos (see Figure 3).
Figure 2: This figure illustrates the idea of using context in the form of spatio-temporal displacements between supervoxels. (a) Given training videos for an action c which have been over-segmented into supervoxels, we construct a context graph for each video as shown in (b). Each graph has edges emanating from all the supervoxels to those that belong to foreground action (circumscribed with dashed green contours). The color of each node in (b) is the same as that of the corresponding supervoxel in (a). Finally, a composite graph (Ξ) from all the context graphs is constructed, implemented efficiently using a kd-tree.
Figure 3: This figure depicts the testing procedure of the proposed approach. (a) Given a testing video, we perform supervoxel (SV) segmentation. (b) A graph G is constructed using the supervoxels as nodes. (c) We find the nearest neighbors of the selected supervoxel (vτ; initially selected randomly) in the composite graph Ξ which returns the displacement vectors learned during training. The displacement vectors are projected in the testing video as shown with yellow arrows. (d) We update the foreground/action confidences of all supervoxels using all the NNs and their displacement vectors. (e) The supervoxel with the highest confidence is selected as vτ+1. (f) The walk is repeated for T steps. (g) Finally, a CRF gives action proposals whose action confidences are computed using SVM.
We evaluate the proposed approach on three challenging action localization datasets: UCF-Sports, sub-JHMDB and THUMOS’13. The qualitative and quantitative results can be seen below:
Figure 4: This figure shows qualitative results of the proposed approach (yellow contours) against ground truth (green boxes) on selected frames of testing videos. The first two rows are from UCF-Sports, third and fourth are from sub-JHMDB, while fifth and sixth rows are from THUMOS’13 datasets. Last row shows two failure cases from sub-JHMDB.
Figure 5: The ROC and AUC curves on UCF Sports Dataset are shown in (a) and (b), respectively. (c) shows the AUC for THUMOS’13 dataset, for which we are the first to report results.
Figure 6: The ROC and AUC curves for sub-JHMDB dataset are shown in (a) and (b), respectively.
Khurram Soomro, Haroon Idrees and Mubarak Shah, Predicting the Where and What of actors and actions through Online Action Localization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.