Center for Research in Comptuer Vision
Center for Research in Comptuer Vision

Action Recognition Using Multiple Features


The fusion of multiple features is important for recognizing actions, since a single feature based representation is not enough to capture imaging variations (view-point, illumination etc.) and attributes of individuals (size, age, gender etc.). We propose to use two types of features: The first feature is the quantized vocabulary of local spatio-temporal (ST) volumes (or cuboids) that are centered around 3D interest points in the video. The second feature is a quantized vocabulary of spin-images, which is aimed at capturing the 3D shape of the actor by considering actions as 3D objects. To optimally combine these features, we develop a mathematical framework that treats different features as nodes in a graph, where weighted edges between the nodes represent the strength of the relationship between entities. The graph is then embedded into a k-dimensional space, subject to the criteria that similar nodes have Euclidian coordinates which are closer to each other. This is achieved by converting this constraint into a minimization problem whose solution is the eigenvectors of the graph Laplacian matrix. The embedding into a common space allows the discovery of relationships among features by using Euclidian distances. The performance of the proposed framework is tested on publicly available data sets. The results demonstrate that fusion of multiple features help in achieving improved performance.


  1. Generating multiple features

  2. Building a Laplacian Graph for embedding

Fiedler Embedding



Feature Generation

Spatiotemporal Features

  1. Apply two separate linear filters respectively to spatial and temporal dimensions as follows,
  2. Apply PCA to gradient based descriptor to lower dimension.
  3. Quantize the features into video-words.
  4. Some examples of video-words.

Spin-image Features

  1. Create Action Volume from contours.
  2. Generate Spin-images from the action volume
    Four elements: the orientated point O, the tangent plane to O, surface points index (alpha, beta).
  3. Bag of Spin-Image features


We applied our method on nine-action dataset and IXMAS multiple view dataset.

1. Qualitative Results

a. Query action videos by spin-image features
b. Query action by ST features
c. Query features by ST feature
d. Query features by spin-image feature
e.Query features by action video

2. Quantitative Results

1) Comparison of original bag of words method and our our weighted bag of words method

2) Comparison of Fiedler Embedding (unlinear method) with LSA (linear method)

3) Contribution of different features towards classification

4) The variation of embedding dimension affects the performance

3. Experiment Results on IXMAS multiple view dataset

1) Action Volume (Checking watch)

2) Learning from four views and testing with single views ( with different features)

3) Learning from four views and recognizing with single view in detail

4) Learning from four views and testing on four views (confusion table)

Related Publication

Jingen Liu, Saad Ali and Mubarak Shah, Recognizing Human Actions Using Multiple Features, IEEE International Conference on Computer Visiona and Pattern Recognition (CVPR), 2008.

Back to Human Action and Activity Recognition Projects