Unsupervised Action Discovery and Localization in Videos
Figure 1. We tackle the problem of Unsupervised Action Localization without any action class labels or bounding box annotations, where a given collection of unlabeled videos contain multiple action classes. First, the proposed method discovers action classes by discriminative clustering using dominant sets (e.g. green and purple contours show clusters for kicking and diving actions, respectively) and then applies a variant of knapsack problem to determine spatio-temporal annotations of discovered actions (yellow bounding boxes). Then, these annotations and action classes are used together to train an action classifier and perform Unsupervised Action Localization.
This paper is the first to address the problem of unsupervised action localization in videos. Given unlabeled data without bounding box annotations, we propose a novel approach that: 1) Discovers action class labels and 2) Spatio-temporally localizes actions in videos. It begins by computing local video features to apply spectral clustering on a set of unlabeled training videos. For each cluster of videos, an undirected graph is constructed to extract a dominant set, which are known for high internal homogeneity and in-homogeneity between vertices outside it. Next, a discriminative clustering approach is applied, by training a classifier for each cluster, to iteratively select videos from the non-dominant set and obtain complete video action classes. Once classes are discovered, training videos within each cluster are selected to perform automatic spatio-temporal annotations, by first over-segmenting videos in each discovered class into supervoxels and constructing a directed graph to apply a variant of knapsack problem with temporal constraints. Knapsack optimization jointly collects a subset of supervoxels, by enforcing the annotated action to be spatio-temporally connected and its volume to be the size of an actor. These annotations are used to train SVM action classifiers. During testing, actions are localized using a similar Knapsack approach, where supervoxels are grouped together and SVM, learned using videos from discovered action classes, is used to recognize these actions. We evaluate our approach on UCF-Sports, Sub-JHMDB, JHMDB, THUMOS13 and UCF101 datasets. Our experiments suggest that despite using no action class labels and no bounding box annotations, we are able to get competitive results to the state-of-the-art supervised methods.
The problem of action recognition is to classify a video by assigning a label from a given set of annotated action classes, whereas in action localization the spatio-temporal extent of an action is detected and is also recognized. Existing action recognition and localization approaches heavily rely on strong supervision, in the form of training videos, that have been manually collected, labeled and annotated. These approaches learn to detect an action using manually annotated bounding boxes and recognize using action class labels from training data. Since, supervised methods have the spatio-temporally annotated ground truth at their disposal, they can take advantage of learning detectors and classifiers by fine-tuning over the training data.
However, supervised algorithms have some disadvantages compared to unsupervised approaches, due to the difficulty of video annotation. First, a video may consist of several actions in complex cluttered background. Second, video level annotation in a supervised setting involves manually labeling the location (bounding box), the class of each action in videos and the temporal boundaries of each action, which is quite time consuming. Third, actions vary spatio-temporally (i.e. in height, width, spatial location and temporal length) resulting in various tubelet deformations. Fourth, different people may have a different understanding of the temporal extent of an action, which results in unwanted biases and errors. Collecting large amounts of accurately annotated action videos is very expensive for developing a supervised action localization approach, considering the growth of video datasets with large number of action classes. On the contrary, training an unsupervised system neither requires action class labels nor bounding box annotations. Given the abundance of unlabeled videos available on the Internet, unsupervised learning approaches provide a promising direction.
In our proposed approach, we first aim to discover action classes from a set of unlabeled videos. We start by computing local feature similarity between videos to apply spectral clustering. Then, within each cluster, we construct an undirected graph to extract a dominant set. This subset is used to train a Support Vector Machine (SVM) classifier within each cluster and discriminatively selects videos from the non-dominant set to assign to one of the clusters in an iterative manner (see Alg. 1).
Given discovered action classes from our discriminative clustering approach, our aim is to annotate the action within each training video in every cluster. We begin by oversegmenting a video into supervoxels, where every supervoxel either belongs to the foreground action or the background. Our goal is to select a group of supervoxels that collectively represent an action. We achieve this goal by solving the 0-1 Knapsack problem: Given a set of items (supervoxels), each with a weight (volume of a supervoxel) and a value (score of a supervoxel belonging to an action), determine the subset of items to in- clude in a collection, so that the total weight is less than a given limit and total value is as high as possible. This combinatorial optimization problem would select supervoxels in a video based on their individual scores, hence resulting in a degenerate solution, where selected supervoxels are not spatio-temporally connected throughout the video. Therefore, we propose a variant of knapsack problem with temporal constraints that enforces the annotated action to be well-connected and the weight limit ensures the detected volume is the size of an actor in the video. Since, the solution to the knapsack problem results in a single action annotation, we solve this problem iteratively to generate multiple annotations, while they satisfy the given constraints (see Fig. 2).
Figure 2. This figure shows the proposed knapsack approach in this paper. (a) Given an input video we extract supervoxel (SV) segmentation. (b) Each supervoxel is assigned a weight (spatio-temporal volume) and a value (score of belonging to the foreground action). (c) A graph Gn is constructed using supervoxels as nodes. (d) Temporal constraints are defined for the graph to ensure contiguous selection of supervoxels from start (σ) to end (τ) of an action. (e) Knapsack optimization is applied to select a subset of supervoxels having maximum value, constrained by total weight (volume of the action) and temporal connectedness. (f) The knapsack process is repeated for more action annotations. (g) Annotations represented by action contours.
We evaluate our Unsupervised Action Discovery and Localization approach on five challenging datasets: 1) UCF Sports 2) JHMDB, 3) Sub-JHMDB 4) THUMOS13, and 5) UCF101. The qualitative and quantitative results are shown below:
Table 1. This table shows action discovery results using C3D on training videos of: 1) UCF Sports 2) Sub-JHMDB, 3) JHMDB, 4) THUMOS13, and 5) UCF101. We also show comparison of C3D and iDTF features on UCF Sports.
Figure 3. This figure shows qualitative results for the proposed approach on UCF Sports, Sub-JHMDB, JHMDB, and THUMOS13 datasets (top four rows). Last row shows failure case from JHMDB dataset. The action localization is shown by yellow contour and ground truth bounding box in green.
Figure 6: This figure shows AUC of the proposed Unsupervised Action Localization approach, along with existing supervised methods on (a) UCF Sports, (b) JHMDB, (c) Sub-JHMDB and (d) THUMOS13. The curves for the [P]roposed method is shown in red and supervised [B]aseline in black, while other supervised localization methods including [L]an et al., [T]ian et al., [W]ang et al., [G]kioxari and Malik, [J]ain et al., [S]oomro et al. are presented with different colors. For UCF Sports we also report our proposed ([P]-i) localization approach by learning a classifier on action discovery using iDTF features.
Khurram Soomro and Mubarak Shah, Unsupervised Action Discovery and Localization in Videos, IEEE International Conference on Computer Vision (ICCV), 2017.