Action Recognition Using Multiple Features
Introduction
The fusion of multiple features is important for recognizing actions, since a single feature based representation is not enough to capture imaging variations (viewpoint, illumination etc.) and attributes of individuals (size, age, gender etc.). We propose to use two types of features: The first feature is the quantized vocabulary of local spatiotemporal (ST) volumes (or cuboids) that are centered around 3D interest points in the video. The second feature is a quantized vocabulary of spinimages, which is aimed at capturing the 3D shape of the actor by considering actions as 3D objects. To optimally combine these features, we develop a mathematical framework that treats different features as nodes in a graph, where weighted edges between the nodes represent the strength of the relationship between entities. The graph is then embedded into a kdimensional space, subject to the criteria that similar nodes have Euclidian coordinates which are closer to each other. This is achieved by converting this constraint into a minimization problem whose solution is the eigenvectors of the graph Laplacian matrix. The embedding into a common space allows the discovery of relationships among features by using Euclidian distances. The performance of the proposed framework is tested on publicly available data sets. The results demonstrate that fusion of multiple features help in achieving improved performance.Framework

Generating multiple features

Building a Laplacian Graph for embedding
 Video, Spatiotemporal features, SpinImage features are nodes of a monolithic graph
 Edges encode coarse similarity measures.
 Embed this graph in a common kdimensional space and find latent relationships.
 Fiedler Embedding
Fiedler Embedding
Intuition Semantically similar vertices have strong intrinsic relationship.
 Geometrically, those vertices are located on a manifold which embedded in a higher dimension space
 Place the vertices into a kdimensional space s.t. the distance between two vertices can be devalued by the Euclidian distance.
Suppose, p_{r} and p_{s} are the locations of the vertices in the kdimensional space, then we need to minimize:
Now, we can verify the minization problem is
with contrains:
Algorithm
Feature Generation
Spatiotemporal Features
 Apply two separate linear filters respectively to spatial and temporal dimensions as follows,
 Apply PCA to gradient based descriptor to lower dimension.
 Quantize the features into videowords.
 Some examples of videowords.
Spinimage Features
 Create Action Volume from contours.
 Generate Spinimages from the action volume
Four elements: the orientated point O, the tangent plane to O, surface points index (alpha, beta).
 Bag of SpinImage features
 Apply PCA to reduce the dimensionality of the spinimages.
 Using kmeans to quantize the spinimage features. A cluster of spinimages is taken as a videoword
 An action is represented by a bag of spinimage videowords.
Results
We applied our method on nineaction dataset and IXMAS multiple view dataset.
Nine actions performed by 9 actors, total 82 video sequences
Bend, Jack, Jump, PJump, Run, Side walk, Walk, One hand wave (wave1), two hand wave (wave2)  200 interest cuboids extracted from each video with sigma =1, and tao=2.
 Codebook size is 200 and 1,000
 Leave one out cross validation scheme
1. Qualitative Results
a. Query action videos by spinimage featuresc. Query features by ST feature
e.Query features by action video
2. Quantitative Results
1) Comparison of original bag of words method and our our weighted bag of words method
 Goal: meaningful group can help the classification
 Suppose the original histogram
 The weighted feature frequency of term i can be, where f(i,j) is a function which returns the jth nearest neighbors of the feature i.
2) Comparison of Fiedler Embedding (unlinear method) with LSA (linear method)
 Average accuracy of Fiedler Embedding: 89.26%
 Average accuracy of Fiedler Embedding: 85.11%
3) Contribution of different features towards classification
4) The variation of embedding dimension affects the performance
3. Experiment Results on IXMAS multiple view dataset
 From each video, 200 cuboids are extracted
 1,000 videowords
 6fold CV scheme, namely 10 videos of actors for learning and the rest for testing
 Our approach does not require 3D construction