Skip to main content

Recognizing Realistic Actions from Videos “in the Wild”


In this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild.” Such unconstrained videos are abundant in personal collections as well as on the web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization.


  • Recognizing realistic actions from unconstrained videos, such as personal videos, TV news videos, web videos (YouTube videos), etc.
  • Mining bags of informative hybrid features
    • Obtain good motion features by motion statistics
    • Acquire informative static features by motion statistics and PageRank
    • Construct semantic visual vocabulary
    • Boost heterogeneous features
  • Performing action recognition without explicitly detecting and tracking subject or its joints, because of our feature mining strategies

YouTube Action Dataset

We collected 11 realistic action categories from YouTube with about 1,600 videos in total. For the details of this dataset, please click here.

  • Challenges in unconstrained videos: large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc.

The flowchart of our system

  • Features Extraction
    • Motion feature: apply two separated filters on the input video, i.e. Gassian filter on space dimension and 1D Gabor filter on time dimension; apply PCA to reduce the dimension of the gradient-based feature descriptor.
    • Static feature: apply three types of feature detector on the sampled frames: Harris-Laplacian, Hessian-Laplacian, and MSER detectors ; represent them by SIFT descriptor.
  • Motion feature pruning by motion statistics.
  • Static feature pruning by motion statistics and PageRank technique.
  • Apply information-theoretic divisive algorithm to obtain discriminative yet compact vocabularies.
  • Select good motion and static features by boosting.

Motion feature pruning

  • Remove abrupt camera motion.
    • Compute histogram of number of motion features at each frame, remove the frames which contain much more number of features than the average.
  • Estimate good features using neighborhood information


Static feature pruning

    • Why static features? Static feature can be treated as the complementary feature of motion.
        • Local shape context of key poses is useful for recognition, as the “boxing” and “clapping” actions in KTH dataset.
      • In unconstrained videos, the correlated object appearance is useful for recognition, such as “horse” in “horseback riding”, “racket” in “tennis playing”, etc.
      • Unlike motion features that may be unrelieable due to camera motion, the detection and extraction of static features are not affected by the camera motion.
    • Why not global context? As the following “soccer juggling” demonstrates, the background (global context) is very diverse even for the videos of same action, as for the unconstrained videos.

Therefore, we want to select informative features from the foreground as the highlight in the following “walking dog” sequences.

  • Estimate Regions of Interest by using motion statistics
  • Using PageRank to obtain consistent features from the foreground.
      • Subject and camera moving left
      • Subject and camera moving right
      • Camera following subject from behind


Learning semantic visual vocabularies

    • Information-theoretic divisive algorithm


Experiments on KTH dataset

The KTH dataset is a wildly used action dataset which has 6 actions with almost 600 videos performed by 25 people.

To verify the effect of hybrid of motion and static features on KTH dataset. 

Experiments on YouTube action dataset

1. The effect of motion feature pruning

2. The effect of static feature pruning

3. the effect of hybrid of motion and static features.

4. some recognition results with localization.

“M”,”S” and “H” in the images means the following judgements are made on the “motion”, “static”, “hybrid of motion and static” features, respectively.

Related Publication

Jingen Liu, Jiebo Luo and Mubarak Shah, Recognizing Realistic Actions from Videos “in the Wild”, IEEE International Conference on Computer Vision and Pattern Recognition(CVPR), 2009.