Skip to main content

Learning Human Actions via Information Maximization


We present a novel approach for automatically learning a compact and yet discriminative appearance-based human action model. A video sequence is represented by a bag of spatiotemporal features called video-words by quantizing the extracted 3D interest points (cuboids) from the videos. Our proposed approach is able to automatically discover the optimal number of videoword clusters by utilizing Maximization of Mutual Information( MMI). Unlike the k-means algorithm, which is typically used to cluster spatiotemporal cuboids into video words based on their appearance similarity, MMI clustering further groups the video-words, which are highly correlated to some group of actions. To capture the structural information of the learnt optimal video-word clusters, we explore the correlation of the compact video-word clusters. We use the modified correlgoram, which is not only translation and rotation invariant, but also somewhat scale invariant. We extensively test our proposed approach on two publicly available challenging datasets: the KTH dataset and IXMAS multiview dataset. To the best of our knowledge, we are the first to try the bag of video-words related approach on the multiview dataset. We have obtained very impressive results on both datasets.

Bag of video-words procedure

There are three steps in bag of video-words modeling. First, detect 3D interest points; Second, extract cuboids surrounding the interest point and compute the corresponding descriptor. Third, group the cuiboids (descriptors) into video-words. Finally, represent the entire video by the histogram of the video-words.


Problems in bag of video-words methods

  • How large of codebook size can achieve good performance?
    • In this project, we can automatically find the optimal number of video-word-clusters by using information maximization approach.
  • Can video-words be further grouped into semantically similar video-word-clusters?
    • Due to Mutual Information being used for further grouping in our project, the new video-word-clusters are able to capture some semantical meaning of the video-words.
  • How to integrate structure (spatial and/or temporal) information into orderless bag of video-words model?
    • We explored two approaches: Spatial Correlogram and Spatiotemporal Pyramid Matching.

Objective Function

We define Mutual Information between X (video-words) and Y (action videos) as follows,

Then, given a mapping , the objective function of the clustering is defined as follows,

Therefore, we are looking for a cluster which has the better tradeoff between discimination and compactness.

The greedy solution

    • Start with the trival partition. It means each video-word is a single cluster;
    • Greedily merge two components which makes the loss of mutual information is minimized. The loss of mutual information due to the merge of two components can be expressed as,

  • The breef steps are showed as follows,

Experiments on KTH dataset

The KTH dataset is a wildly used action dataset which has 6 actions with almost 600 videos performed by 25 people.


  1. The classification performance comparison between the initial codebook and the optimized codebook under different initial codebook sizes.
  2. The performance comparison between using MMI clustering and directly apply k-means algorithm.

    >> MMI clustering reduces the initial dimension of 1,000 to the corresponding number.
    >> MMI discovered the Nc=177 as the optimal number of VWCs with average accuracy 91.3%.
  3. The importance of the video-words clusters and their structural information

    The first row shows the examples of six actions. The following tow rows respectively demonstrate the distribution of the optimal 20 video-words clusters using our approach and 20 video-words using k-means. We superimpose the 3D interest points in all frames into one image. Different clusters are represented by different color codes. Note that our model is more compact e.g. see “waving” and “running” actions.
  4. Spatial correlogram
    Given P_{i} as cuboids, l_{i} as labels and D_{i} as quantized distance, the spatial correlogram can be defined as a probability,

    We show the confusion table of the performance of the video-word-cluster correlogram as follows,

    The number of VWC is 60, and 3 quantized distances are used (average accuracy is 94.15%).
  5. The performance of the different bag of video-words related approaches.

Experiments on IXMAS multiview dataset

We picked up 4 views and 13 actions acted by 12 actors (The top view was not included in our experiments). We show some examples in the following picture.


  1. The performance of learning with four views, while test on single view

    The 189 video-word-clusters are learned from 1,000 video-words. The performance is better than Weinland’s ICCV 2007 result, which is {65.4, 70.0, 54.3, 66.0} %. Besides, their method requirs 3D reconstruction. The following figure gives the detail classification performance of each view.
  2. The performance of learning from three views, and test on the rest view.

Related Publication

Jingen Liu and Mubarak Shah, Learning Human Actions via Information Maximization, IEEE International Conference on Computer Vision and Pattern Recognition(CVPR), 2008.