Skip to main content

Discovering Motion Primitives for Unsupervised Grouping and One-shot Learning of Human Actions, Gestures, and Expressions


Given the very large volume of existing research in the area of action recognition, we observe that action representation can range from spatiotemporally local to global features. On one end of this spectrum are interest point based representations where a single descriptor encodes the appearance or motion, in very small x-y-t volumes, while on the other hand features based on actor silhouettes, contours, and spacetime volumes attempt to capture the entire action in a single feature ignoring the intuitive compositional hierarchy of action primitives. This observation forms the basis of the proposed representation with the underlying idea that intermediate features (action primitives) should: (a) span as large as possible but contiguous x-y-t volumes with smoothly varying motion, and (b) should be flexible enough to allow deformations arising from articulation of body parts. A byproduct of these properties is that the intermediate representation will be conducive to human understanding. In other words, a meaningful action primitive is one which can be illustrated visually, and described textually, e.g., left arm moving upwards, or right leg moving outwards and upwards, etc.

    Figure 1: The proposed probabilistic representation of primitive actions.

This project proposes such a representation based on motion primitives, examples of which are shown in Figure. 1. The main goals of the proposed approach are: (1) to enable recognition using very few examples, i.e., one, or k-shot learning, and (2) meaningful organization of unlabelled data sets by unsupervised clustering. Our proposed representation is obtained by automatically discovering high level sub-actions or motion primitives, by hierarchical clustering of observed optical flow in four dimensional, spatial and motion flow space.

Motion Primitives Discovery

The goal of the proposed human action representation is two fold: (i) to automatically discover discriminative, and meaningful sub-actions (or primitives) within videos of articulated human actions, without assuming priors on their number, type, or scale, and (ii) to learn the parameters of a statistical distribution that best describes the location, shape, and motion flow of these primitives, in addition to their probability of occurrence. Our choice for action primitives modeling is to estimate a statistical description of regions of similar optical flow using Gaussian mixture distributions. The details of our framework are described in the following subsections. Figure 2 shows the process flow of the proposed approach for action representation, recognition and clustering.

Figure 2: The process flow of the proposed approach for action representation, recognition and clustering.

Low Level Feature Computation

We first employ a simple process which includes computation of intensity difference images for consecutive frames, and thresholding of this difference image to obtain the motion blob, which is then represented as a coarse bounding box. These bounding boxes obtained in successive frames are then stacked together to obtain a sequence of cropped image frames for the entire training data set. Lucas-Kanade optical flow is then computed for the centralized training videos. Some of the noise in optical flow is eliminated by removing flow vectors with magnitude below a small threshold. The resulting optical flow captures articulated motion as shown in Fig. 3. We propose that an action primitive be described as a Gaussian mixture distribution. The goal of the training (or learning) phase then, is to estimate the parameters of each such mixture distribution, where the number of primitive actions (motion patterns), as well as the number of components, in each pattern’s mixture distribution are unknown.

Figure 3: Process of optical flow computation for 4 frames from Weizmann ‘Side’ action.

Gaussian Mixture Learning

We begin by performing a K-means clustering of all the 4d feature vector (optical flow uv and location xy) obtained, as shown in Fig. 4(b). The value of K is not crucial and the goal is to obtain many, low variance clusters, which will become the Gaussian components in the motion patterns mixture distributions. The clustering is performed separately for D, short disjoint video clips, each of which contains k frames.

Figure 4: Illustration of primitives discovery from optical flow.


The eventual goal is to find out which of these components belong to each primitive action’s distribution. We notice that the action primitive, represented as a motion pattern, repeats itself within the video of an action (because most action videos are cyclic), as well as within the training data set (because there are multiple examples of each action). Therefore, we first attempt to further group the Gaussian components, such that each repetition of a primitive is represented by such a high level group. We employ a Mahalanobis distance based measure to define a weighted graph, G = {C, E , W }, where E and W are Z� Z matrices corresponding to edges and their weights. Whenever two components, Ci and Cj occur in consecutive, k-frames long, video clips, an edge exists between Ci and Cj , the element eij is 1. The weighted graph G is then converted into an un-weighted one, by removing edges with weights below a certain threshold. A connected components analysis of this unweighted graph gives P sequences (mixtures) of Gaussian components, each of which is assumed to be a single occurrence of an action primitive, e.g., one instance of  torso moving down. Each such sequence of components (action primitive instance) is a Gaussian mixture. We observe that these action primitives are shared in multiple similar actions, e.g., right arm going upwards is shared by  one hand waving as well as  both hands waving.

Figure 5: Illustration of Graph G for components: (left) spatial means and covariances shown as colored dots and ellipses, with color corresponding to mean optical flow; (right) edge weights depicted by shades of gray.

Action Representation

Given the automatic discovery, and statistical representation of action primitives, the next step in our proposed framework is to obtain a representation of the entire action video. For unseen test videos, this process is similar to the primitive discovery phase. However, since a test video is typically short, and contains at most a few cycles of the action, we do not perform the final step of primitive instance merging. This is because, for most test videos, only a single instance of action primitives is observed. We therefore obtain a set of motion primitives for a test video, and our goal is to relate these primitive instances to the previously learned representation, i.e., the action primitives learned from the entire training set which form the action vocabulary. This relationship is established by finding the KL divergence between each motion pattern observed in the test video, and all learned action primitives, and assigning it the label (or index) of the primitive with the least divergence. This process is illustrated in Fig. 6, where the second row shows patterns observed in a particular video, and the third row shows the corresponding primitives that each pattern was assigned to. The end result of this process then, is the representation of the given video, as a temporal sequence of action primitive labels, e.g., T = ({19, 23}, {20, 24}, {27, 23}) in the example in Fig. 6.

Figure 6: Process of obtaining test video representation: (row 1): 6 frames (1.5 cycles) from Kecklab action  go back. (row 2): 3 pairs of co-occurring primitive instances for the test video (colors correspond to mean direction and brightness to mean magnitude). Horizontal braces ({) on top indicate co-occurring pairs. (row 3): results of primitive labeling using KL-divergence. Learned primitive with least divergence picked as label and shown at bottom. Down- ward arrows indicate correctness of labeling per primitive. The action model is represented by the sequence T = ({19, 23}, {20, 24}). The only incorrect label is of the 5th detected primitive, labeled as 27 instead of 19.

For evaluation of the quality and discriminative nature of our proposed primitive actions, we put forth three different high level representations of an action video (Histogram of Action Primitives, Strings of AP and Temporal Model of AP), all of which employ the observed primitives.

Experiments and Results

The proposed primitive representation has been evaluated for five human action datasets, as well as a composite dataset, two human gestures datasets, and a facial expressions database. We tested our representation using three different high level representations (strings, histograms, and HMMs), for three distinct applications (unsupervised clustering, 1/k-shot recognition, and conventional supervised recognition), to show representation and recognition quality and performance. Our extensive experiments on a variety of datasets provide insight into not only how our framework compares with state-of-the-art, but also into the very nature of the action recognition problem.

Unsupervised Clustering

In this experiment, all videos in the dataset are used to learn the action primitives representation, and the videos are represented as strings of primitive labels. A value of 50 was used for K (in k-means) for all datasets except the Cohn-Kanade face expressions databased, where K =30. A string matching similarity matrix of all videos is then constructed, and clustering is performed by thresholding and graph connected components to obtain groups of videos. Each video in a cluster is then assigned the same label as that of the dominating class within the cluster, and comparison of the assigned label with the ground truth label, provides classification accuracy. The results of these experiments on most datasets are summarized in Table 1.

Table 1: Quantitative comparison of different representations for unsupervised clustering with and without actor centralization. BBx in column 2 implies  bounding box.

One-shot and K-shot Learning

This experiment was performed for Kecklab Gesture, Weizmann, and UCF YouTube datasets, as well as Cohn-Kanade face expressions database, and the recently posed ChaLearn Gesture Challenge dataset. Fig. 7 shows the performance of the proposed representation using a variable number of training examples, as well as comparison to BoVW framework with same settings using Dollar features.

Figure 7: Classification of actions using a variable number of training examples. The values corresponding to 1 on the X-axis are essentially results of one-shot learning.

Supervised Recognition

Finally, we present our results using the traditional supervised learning approach. In this experiment, a dataset is divided into training and testing sets. The primitive action representation is learned from examples in the training sets. The training videos are then represented as strings of the primitive labels. Given a test video, pattern instances are estimated using the proposed approach, which are then represented as Gaussian mixtures. These distributions are then compared against the learned primitive distributions using KL divergence, and labeled. The test video is thus also represented as a string of learned primitives. Finally, a string matching based nearest neighbor classifier is employed to assign an action label to the test video. The results on different datasets using this approach are reported in Table 2.

Table 2: Quantitative comparison of the proposed action primitives with some of the state-of-the-art techniques for supervised action recognition.

Related Publication

Yang Yang, Imran Saleemi, and Mubarak Shah, Discovering Motion Primitives for Unsupervised Grouping and One-Shote Learning of Human Actions, Gestures, and ExpressionsIEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2012.