Center for Research in Comptuer Vision
Center for Research in Comptuer Vision

KEYS - Incremental Action Recognition Using Feature-Tree


Human action recognition has been one of the most challenging problems studied over the last two decades for reliable and effective solutions. Most current approaches suffer from many drawbacks in practice, which include

(1) The inability to cope with incremental recognition problems.

(2) The requirement of an intensive training stage to obtain good performance.

(3) The inability to recognize simultaneous multiple actions.

(4) Difficulty in performing recognition frame by frame.

In order to overcome all these drawbacks using a single method, we propose a novel framework involving the feature-tree to index large scale motion features using Sphere/Rectangle-tree (SR-tree). The recognition consists of two steps: 1) recognizing the local features by non-parametric nearest neighbor (NN), 2) using a simple voting strategy to label the action. The proposed method can provide the localization of the action. Since our method does not require feature quantization, the feature-tree can be efficiently grown by adding features from new training examples of actions or categories. Our method provides an effective way for practical incremental action recognition. Furthermore, it can handle large scale datasets due to the fact that the SR-tree is a disk-based data structure. We have tested our approach on two publicly available datasets, the KTH and the IXMAS multi-view datasets, and obtained promising results.

Proposed Method

Figure 1 shows the main steps in our method for action recognition using feature tree. There are two phases: Figure 1 (a) depicts the procedure of construction of the feature-tree and Figure 1 (b) describes the recognition steps. In the feature-tree construction stage, local spatiotemporal features are detected and extracted from each labeled video, and then each feature is represented by a pair [d,l] where d is the feature descriptor and l is the class label of the feature. Finally, we index all the labeled features using SR-tree. In the recognition stage, given an unknown action video, we first detect and extract local spatiotemporal features, and then for each feature we launch a query into the feature-tree. A set of nearest neighbor features and their corresponding labels are returned for each feature. Each returned nearest neighbor votes for its label. This process is repeated for each feature, and these votes can be weighted based on the importance of each nearest neighbor. Finally, a video is assigned a label, which receives the most votes. The entire procedure does not require intensive training. Therefore, we can easily apply incremental action recognition using the feature-tree. Besides, we also can easily localize the actions in the video, and detect multiple actions occurring simultaneously in the same video.


Feature Extraction

In order to detect an action in a given video, we use the spatiotemporal interest point detector proposed by Dollar et al.. Compared to the 3D Harris-Corner detector, it produces dense features that can significantly improve the recognition performance in most cases. This detector utilizes 2-D Gaussian filter and 1-D Gabor filter separately in spatial and temporal directions respectively. A response value is given by the following function at every position (x, y, t):

where is the spatial Gaussian filter, and and are a quadrature pair of the 1-D Gabor filter in time. It produces high responses to the temporal intensity change points. N interest points are selected at the locations of local maximal responses, and 3D cuboids are extracted around them. For simplicity, we used the flat gradient vectors to describe the cuboids, and utilized PCA to reduce the descriptor dimension, so a video is represented by a set of cuboids .

Constructing Feature-Tree

Given a set of action videos V = {v1, v2,…,vM} and its corresponding class label li {1,2,…,C}, we extract n spatiotemporal features (1≤j≤n) from video vi. We associate each feature with its class label and get a two-elements feature tuple . Then we obtain a collection of labeled features T = {} to construct our feature-tree. We use SR-Tree for feature-tree construction which is explained later.

Action Recognition

Given the feature-tree constructed during the training, the action recognition starts by extracting spatiotemporal features from an unknown action video. Suppose the query action video is represented by a set of spatiotemporal features, say Q = {dq} (1≤q≤M), our recognition task is to find its class label. Instead of using a complicated learning method, we adopt simple voting schema.

For each query feature dq, we retrieve the labels of the K nearest neighbors from the feature-tree, and assign it a class label based on the votes of the labels of the K returned features. Then, the label of Q is decided by the final voting of the class labels of the query features in Q. Note that, generally the query features of Q may contain good and bad features (good feature meaning discriminative features). In order to distinguish the votes from them, we can assign a class label to Q using the following equation: (1) where NN(dq), is an indicator function which is equal to 1 if the label of dr is C, is zero otherwise, and , are constant numbers (NN - Nearest Neighbor). Therefore, the contribution of each feature also depends on how well it matches the query feature. Noisy features (bad features) usually do not have good matches.

The nearest neighbor search algorithm implements an ordered depth first traversal. First, it collects the candidate set, and secondly it revises the candidate set by visiting each leaf having its region intersected with the candidate set. The search is terminated if there are no more leaves to visit and the final candidate set is the search result. The search is performed by computing the distance of the search point to the region of the child and visiting the child which is closer. Since the region for an SR-tree is the intersection of a bounding sphere and a bounding rectangle, the minimum distance from a search point to a region is defined as the maximum between the minimum distance to the bounding sphere and the minimum distance to the bounding rectangle. We summarize the nearest search procedure and the action recognition steps in Table 1 and Table 2.

Table 1 Main Steps for Action recognition using feature-tree

Objective: Given a query videoQ, assign it a class label.
Feature Extraction:
1. For a given video Q, detect the spatiotemporal interest points using the Dollar detector and extract the cuboids around them.

2. Compute gradient for each cuboid dq, and use PCA to reduce the dimension, then represent it by a descriptor dq.

3. Represent given video Q as a bag of features {dq}.
Action Recognition:

Given the query features of Q: {dq.

1. For each query feature dq in Q retrieve the nearest neighbor

2. The class label of Q is decided by equation:


SR-tree is chosen to index the labeled feature collection T. It is a disk-based data structure having successfully integrated the advantages of the R-tree and SS-tree data structures, so it is fast for large scale datasets which cannot be fit into the main memory. We briefly review its data structure as follows. Fig 2 shows the nested hierarchy structure with a root node, nodes, and leaves. Instead of splitting the feature space by a hyper plane, SR-tree organizes the data by hierarchical regions. The region is defined by the intersection of a bounding sphere (the smallest sphere covering a set of points) and a bounding rectangle. This design creates a balance between having larger region diameter and smaller region volume to improve the performance of nearest neighbors search. Each leaf of an SR-tree has a minimum and maximum limit on the number of entries it can take. Entries in the leaf store the points with their labels. A leaf is split and converted into a node if it reaches the maximum number of allowed entries. Nodes do have a limit on the number of children they can hold. Each entry in a node has four components: bounding sphere, bounding rectangle, number of points in the sub-tree, and pointers to all its children. We can consider the regions in our feature-tree as a cluster of feature points which are spatially close to each other. We can imagine that the same type of actions will share several feature patterns, so each leaf node may have dominated class labels. The feature-tree can grow in three steps:

1. Inserting a new feature point into the tree.
2. Splitting a leaf when it is overcrowded.
3. Reinserting the feature points from step 2.


A centroid-based algorithm is applied in the insertion step due to its effectiveness for nearest neighbor queries. The insertion algorithm is supposed to find the best sub-tree that can accommodate the new spatiotemporal feature point from T. The sub-tree, whose centroid (i.e. the center of the feature points covered by it) is nearest to , is selected. After insertion, the regions (the bounding sphere and the bounding rectangle of the parents) are updated. In fact, the region updates are propagated upwards until the root is reached. The bounding rectangle is updated the same way as it is done in R-tree. The bounding sphere, however, is updated using both the bounding sphere and bounding rectangle of its children. Using both the bounding sphere and bounding rectangle of the children helps the radius of the parent’s bounding sphere to be smaller compared to the SS-tree, which in turn reduces the overlap of the bounding spheres.


If a leaf is overcrowded, a portion of its entries are reinserted. If the reinsertion fails, it is split and converted into a node and the entries are reinserted.


An entry is simply removed if the deletion of the entry causes no under-utilization of any leaf or node. If the deletion does cause under-utilization of a leaf or node, the leaf or node is removed from the tree and the orphaned entries are reinserted into the tree.

Table 2 Nearest Neighbor Search

Objective: Given a query feature point, find the nearest neighbors
1. Start with the candidate set nearest to the query feature point dq.

2. Search each leaf having its region intersecting the candidate set.

3. Start with the leaf closest to the query point dq.

4. If K nearest neighbors dr are found terminate.

5. If not, go to the next closest leaf whose region intersect with the candidate set, and do this till all the K nearest neighbors are found.


Our proposed method has many advantages compared to the existing action recognition approaches.

(1) First, we can effectively and efficiently integrate indexing and recognition using our proposed feature-tree. As a result of this, we successfully avoid any intensive training (vocabulary construction and category model training).

(2) Second, since the feature-tree grows when additional training features are added from the new training examples or a new category, the resultant tree is very useful for incremental action recognition in many realistic applications.

(3) Third, the tree provides a disk-based data structure, which makes our recognition system scalable for large scale datasets.

Finally, the recognition can be performed nearly in real time. For example, for the KTH dataset, each feature query takes 0.02s on a single core 2 GHZ Pentium processor, and if each frame has about 3 features, it only takes 0.06s for the frame-based recognition, and the performance is still competitive. Our system can also be used to detect occurrence of multiple actions which are happening simultaneously and localize them.

Experiments and Results

We tested our approach on two publicly available datasets: the KTH dataset and the IXMAS multi-view dataset. The default experiment settings are as follows, unless and until otherwise specified. For each video we extract 200 interest points and corresponding cuboids. Gradient based feature descriptor is used in all the experiments.

Experiments on the KTH Dataset

We applied our approach to the KTH dataset, which contains six actions: boxing, clapping, waving, jogging, walking, and running. They are performed by 25 actors under four different scenarios of illumination, appearance, and scale changes. In total the data set contains 598 video sequences.

Confusion table for KTH data set using feature-tree and random-forest: Sequences corresponding to 5 persons were used to construct the tree.

Effect of number of Features, feature dimensions and Nearest Neighbors: The performance analysis is done by varying the number of features from 10 to 200 and the number of nearest neighbors from 1 to 50.

Incremental action recognition: We first train our tree with 5 people on 4 actions (boxing, clapping, waving, walking). We add one person at a time from action 5 to the feature- tree, and analyze the impact. We expand the tree with features from 5 more persons for action 5 (Jogging), which were previously not included in the training set. From our experiment we observe that the performance increases as the tree is expanded with new training data.

Classification and localization of two actions happening in the same video: The features assigned boxing labels are shown by red, and the ones assigned waving are shown by blue.

Action Recognition in frame by frame model

Experiments on IXMAS Multi-view dataset

We also applied our method on the IXMAS multi-view dataset. The dataset contains 13 daily life actions, performed three times by 12 actors under four different views. In total it contains 1848 video sequences. Here we conduct two different sets of experiments.

Confusion table by learning from four views: In the first experiment, learning is done using four views.
Performance by learning from three views: In the second experiment, we used videos from three camera views to construct the feature-tree, and the fourth camera view to test the system.

Related Publication

Kishore K. Reddy, Jingen Liu and Mubarak Shah, Incremental Action Recognition Using Feature-Tree, International Conference on Computer Vision (ICCV), September 2009.

Back to Computer Vision Systems Projects