Paper: | |
Contact: |
Predicting the Where and What of actors and actions through Online Action Localization

Figure 1: This figure illustrates the problem we address in this paper. The top row shows the case when we have an entire video to detect and recognize actions, i.e., offline action localization. The bottom row is an example of online action localization, which involves predicting the action class (e.g. Kicking) as well as the location of the actor in every frame, as the video is streamed.
This paper proposes a novel approach to tackle the challenging problem of ‘online action localization’ which entails predicting actions and their locations as they happen in a video. Typically, action localization or recognition is performed in an offline manner where all the frames in the video are processed together and action labels are not predicted for the future. This disallows timely localization of actions – an important consideration for surveillance tasks.
In our approach, given a batch of frames from the immediate past in a video, we estimate pose and oversegment the current frame into superpixels. Next, we discriminatively train an actor foreground model on the superpixels using the pose bounding boxes. A Conditional Random Field with superpixels as nodes, and edges connecting spatio-temporal neighbors is used to obtain action segments. The action confidence is predicted using dynamic programming on SVM scores obtained on short segments of the video, thereby capturing sequential information of the actions. The issue of visual drift is handled by updating the appearance model and pose refinement in an online manner. Lastly, we introduce a new measure to quantify the performance of action prediction (i.e. online action localization), which analyzes how the prediction accuracy varies as a function of observed portion of the video. Our experiments suggest that despite using only a few frames to localize actions at each time instant, we are able to predict the action and obtain competitive results to state-of-the-art offline methods.
Predicting what and where an action will occur is an important and challenging computer vision problem for automatic video analysis. In many applications associated with monitoring and security, it is crucial to detect and localize actions in a timely fashion. A particular example is detection and localization of undesirable or malicious actions. There have been recent efforts to predict activities by early recognition. These methods only attempt to predict the label of the action, what of an action, without any localization. Thus, the important question about where an action is being performed cannot be answered easily.
Existing action localization methods classify and localize actions after completely observing an entire video sequence (top row in Fig. 1). The goal is to localize an action by finding the volume that encompasses an entire action. Some approaches are based on sliding-windows, while others segment the video into supervoxels which are merged into action proposals. The action proposals from either methods are then labeled using a classifier. Essentially, an action segment is classified after it has been localized. Since offline methods have whole video at their disposal, they can take advantage of observing entire motion of action instances. In this paper, we address the problem of Online Action Localization, which aims at localizing an action and predicting its class label in a streaming video (see bottom row in Fig. 1). Online action localization involves the use of limited motion information in partially observed videos for frame-by-frame action localization and label prediction.
Our approach (Fig. 2) begins by segmenting the testing video frames into superpixels and detecting pose hypotheses within each frame. The features computed for each superpixel are used to learn a superpixel-based appearance model, which distinguishes the foreground from the background. Simultaneously, the conditional probability of pose hypotheses at current time-step (frame) is computed using pose confidences and consistency with poses in previous frames. The superpixel and pose-based foreground probability is used to infer the action location at each frame through Conditional Random Field. The action label is predicted within the localized action bounding box through dynamic programming using scores from Support Vector Machines (SVMs) on short video clips. These SVMs were trained on temporal segments of the training videos. After localizing action at each time-step (frame), we refine poses in a batch of few frames by imposing spatio-temporal consistency. Similarly, the appearance model is updated to avoid visual drift. This process is repeated for every frame in an online manner and gives action localization and prediction at every frame.

Figure 2: This figure shows the framework of the approach proposed in this paper. (a) Given an input video, (b) we over-segment each frame into superpixels and detect poses using an off-the-shelf method. (c) An appearance model is learned using all the superpixels inside a pose bounding box as positive, and those outside as negative samples. (d) In a new frame, the appearance model is applied on each superpixel of the frame to obtain a foreground likelihood. (e) To handle the issue of visual drift, poses are refined using spatio-temporal smoothness constraints on motion and appearance. (f) Finally, a CRF is used to obtain local action proposals, which are then utilized to predict the action through dynamic programming on SVM scores.
We evaluate our online action localization approach on two challenging datasets: 1) JHMDB and 2) UCF Sports. The qualitative and quantitative results are shown below:

Figure 3: This figure shows qualitative results of the proposed approach, where each action segment is shown with yellow contour and ground truth with green bounding box. Results in the top three rows are from JHMDB, and the bottom three rows are from UCF Sports datasets.

Figure 4: This figure shows action prediction and localization performance as a function of observed video percentage. (a) shows prediction accuracy for JHMDB and UCF Sports datasets; (b) and (c) show localization accuracy for JHMDB and UCF Sports, respectively. Different curves show evaluations at different overlap thresholds: 10% (red), 30% (green) and 60% (pink).

Figure 5: This figure shows per-action prediction accuracy as a function of observed video percentage for (a) JHMDB and (b) UCF Sports datasets.

Figure 6: This figure shows localization results of proposed method along with existing methods on JHMDB and UCF Sports datasets. (a) shows AUC curves for JHMDB, while (b) and (c) show AUC and ROC @ 20%, respectively, for UCF Sports dataset. The curves for the proposed method is shown in red, with baseline for online localization in gray.
[PDF] [PowerPoint]
[Download]
Khurram Soomro, Haroon Idrees and Mubarak Shah, Predicting the Where and What of actors and actions through Online Action Localization, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.