Recognizing 50 Human Action Categories of Web Videos
Most state-of-the-art methods developed for action recognition are tested on datasets like KTH, IXMAS, and Hollywood (HOHA), which are largely limited to a few action categories and typically taken in constrained settings. Table.1 shows the list of action datasets.
In this work we study the effect of large datasets on performance, and propose a framework that can address issues with real life action recognition datasets (UCF50). The main contributions of this work are as follows:
(1) Provide an insight into the challenges of large and complex datasets like UCF50.
(2) We propose the use of moving and stationary pixels information obtained from optical flow to obtain our scene context descriptor.
(3) We show that as the number of actions to be categorized increases, the scene context plays a more important role in action classification.
(4) We propose the idea of early fusion schema for descriptors obtained from moving and stationary pixels to understand the scene context, and finally perform a probabilistic fusion of scene context descriptor and motion descriptor.
Table 1: Action Datasets
|Datasets||Number of Actions||Camera motion||Background|
Analysis on large scale dataset
UCF50 is the largest action recognition dataset publicly available, after excluding the non articulated actions from HMDB51 dataset. UCF50 has 50 action categories with a total of 6676 videos, with a minimum of 100 videos for each action class. Samples of video screenshots from UCF50 are shown in Figure 1. This dataset is an extension of UCF11. In this section we do a base line experiment on UCF50 by extracting the motion descriptor, and using bag of video words approach. We use two classification approaches:
(1) BoVW-SVM: Support vector machines to do classification.
(2) BoVW-NN: Nearest neighbor approach using SR-Tree to do classification.
Which motion descriptor to use?
Due to the large scale of the dataset, we prefer a motion descriptor which is faster to compute and reasonably accurate. In order to decide on the motion descriptor, we performed experiments on a smaller dataset KTH with different motion descriptors, which were extracted from the interest points detected using Dollar’s detector. At every interest point location (x,y,t), we extract the following motion descriptors:
Gradient: At any given interest point location in a video (x,y,t), a 3D cuboid is extracted. The brightness gradient is computed in this 3D cuboid, which gives rise to 3 channels (Gx,Gy,Gt) which are flattened into a vector, and later PCA is applied to reduce the dimension.
Optical Flow: Similarly, Lucas-Kanade optical flow is computed between consecutive frames in the 3D cuboid at (x,y,t) location to obtain 2 channels (Vx,Vy). The two channels are flattened and PCA is utilized to reduce the dimension.
3D-SIFT: 3-Dimensional SIFT proposed by Scovanner et al., is an extension of SIFT descriptor to spatio-temporal data. We extract 3D-SIFT around the spatio-temporal region of a given interest point (x,y,t).
Table 2: Performance of different motion descriptors on the KTH Dataset.
|Method||Codebook 100||Codebook 200||Codebook 500|
All of the above descriptors are extracted from the same location of the video and the experimental setup is identical. We use BOF paradigm and SVM to evaluate the performance of each descriptor. From Table 2, one can notice that 3D-SIFT outperforms the other two descriptors for codebook of size 500, whereas gradient and optical flow descriptors perform the same. Computationally, gradient descriptor is the fastest and 3D-SIFT is the slowest. Due to the time factor, we will use gradient descriptor as our motion descriptor for all further experiments.
We also tested our framework on the recently proposed motion descriptor MBH by Wang et al.. MBH descriptor encodes the motion boundaries along the trajectories obtained by tracking densely sampled points using optical flow fields. Using the code provided by the authors, MBH descriptors are extracted for UCF11 and UCF50 datasets and used in place of above mentioned motion descriptor for comparision of results with Wang et al..
Effect of increasing the action classes
In this experiment, we show that increasing the number of action classes affects the recognition accuracy of a particular action class. Since the UCF11 dataset is a subset of UCF50, we first start with the 11 actions from the UCF11 dataset and randomly add new actions from the remaining 39 different actions from the UCF50 dataset. Each time a new action is added, a complete leave-one-out cross validation is performed using bag of video words approach on motion descriptor and SVM for classification on the incremented dataset using a 500 dimension codebook.
Figure 2: The effect of increasing the number of actions on the UCF YouTube Action dataset’s 11 actions by adding new actions from UCF50 using only the motion descriptor. Standard Deviation (SD) and Mean are also shown next to the action name.
Figure 2 shows the change in performance by using BoVW-SVM on the initial 11 actions as we add the 39 new actions, one at a time. Increasing the number of actions in the dataset has affected some actions more than others. Overall the performance on 11 actions from UCF11 dropped by ~13.18%, i.e., from 55.45% to 42.27%, by adding 39 new actions from UCF50. This shows that the motion feature alone is not discriminative enough to handle more action categories.
To address the above concerns, we propose a new scene context descriptor which is more discriminative and performs well in huge action datasets with a high number of action categories. From the experiments on UCF50, we show that the confusion between actions is drastically reduced and the performance of the individual categories increased by fusing the proposed scene context descriptor.
In order to overcome the challenges of unconstrained web videos, and handle a large dataset with lots of confusing actions, we propose using the scene context information in which the action is happening. For example, skiing and skate boarding, horse riding and biking, and indoor rock climbing and rope climbing have similar motion patterns with high confusion, but these actions take place in different scenes and contexts. Skiing happens on snow, which is very different from where skate boarding is done. Similarly, horse riding and biking happen in very different locations. Furthermore, scene context also plays an important role in increasing the performance on individual actions. Actions are generally associated with places, e.g., diving and breast stroke occur in water, and golf and javelin throw are outdoor sports. In order to increase the classification rate of a single action, or to reduce the confusion between similar actions, the scene information is crucial, along with the motion information. We refer to these places or locations as scene context in our paper.
As the number of categories increases, the scene context becomes important, as it helps reduce the confusion with other actions having similar kinds of motion. In our work, we define scene context as the place where a particular motion is happening (stationary pixels), and also include the object that is creating this motion (moving pixels).
It has been shown that humans tend to focus on objects that are salient; This unique capability helps improve object detection, tracking, and recognition. In general, humans tend to focus on the things that are moving in their field of view. We try to mimic this by coming up with groups of moving pixels which can be roughly assumed as salient regions and groups of stationary pixels as an approximation of non-salient regions in a given video.
Moving and Stationary Pixels: Optical flow gives a rough estimate of velocity at each pixel given two consecutive frames. We use optical flow (u,v) at each pixel obtained using Lucas-Kanade method and apply a threshold on the magnitude of the optical flow, to decide if the pixel is moving or stationary. Figure 3 shows the moving and stationary pixels in several sample key frames. We extract dense CSIFT at pixels from both groups, and use BOF paradigm to get a histogram descriptor for both groups separately. We performed experiments using CSIFT descriptor, extracted on a dense sampling of moving pixels SPv and stationary pixels MPv. For a 200 dimension codebook, the moving pixels CSIFT histogram alone resulted in a 56.63% performance, while the stationary pixels CSIFT histogram achieved 56.47% performance on the UCF11. If we ignore the moving and stationary pixels and consider the whole image as one, we obtain a performance of 55.06%. Our experiments show that concatenation of histogram descriptors of moving and stationary pixels using CSIFT gives the best performance of 60.06%. From our results, we conclude that concatenation of MPv and SPv into one descriptor SCv is a very unique way to encode the scene context information. For example, in a diving video, the moving pixels are mostly from the person diving, and the stationary pixels are mostly from the water (pool), which implies that diving will occur only in water and that this unique scene context will help detect the action diving.
Figure 3: Moving and stationary pixels obtained using optical flow.
Key frames: Instead of computing the moving and stationary pixels and their corresponding descriptor on all the frames in the video, we perform a uniform sampling of k frames from a given video, as shown in Figure 4. This reduces the time taken to compute the descriptors, as the majority of the frames in the video are redundant. We did not implement any kind of key frame detection, which can be done by computing the color histogram of frames in the video and considering a certain level of change in color histogram as a key frame. We tested on the UCF11 dataset by taking different numbers of key frames sampled evenly along the video. Figure 5, shows that the performance on the dataset is almost stable after 3 key frames. In our final experiments on the datasets, we consider 3 key frames equally sampled along the video, to speed up the experiments. In this experiment, a codebook of dimension 500 is used.
Figure 4: Key frame selection from a given video.
Figure 5: Performance of scene context descriptor on different number of key frames.
In this experiment the proposed scene context descriptors are extracted, and a bag of video word paradigm followed by SVM classification, is employed to study the proposed descriptor. Similar to the previous experiment, one new action is added to UCF11 incrementally from UCF50, at each increment leave-one-out cross-validation is performed. The average performance on the initial 11 actions of UCF11, is 60.09%, after adding 39 new actions from UCF50 the performance on the 11 actions dropped to 52.36%. That is a ~7.72% decrease in performance, compared to ~13.18% decrease for motion descriptor. The average standard deviation of the performance of the initial 11 actions over the entire experimental setup is ~2.25% compared to ~4.18% for motion descriptor. Figure 6, clearly shows that the scene context descriptor is more stable and discriminative than the motion descriptor with the increase in the number of action categories.
Figure 6: Effect of increasing the number of actions on the UCF YouTube Action dataset’s 11 actions by adding new actions from UCF50, using only the scene context descriptor. Standard Deviation (SD) and Mean are shown next to the action name.The performance on initial 11 actions decrease as new actions are added, but with significantly less standard deviation compared to using motion descriptor as shown in Figure 2.
A wide variety of visual features can be extracted from a single video, such as motion features (e.g., 3DSIFT, spatio-temporal features), scene features (e.g., GIST), or color features (e.g., color histogram). In order to do the classification using all these different features, the information has to be fused eventually. According to Snoek et al., fusion schemes can be classified into early fusion and late fusion based on when the information is combined.
Early Fusion: In this scheme, the information is combined before training a classifier. This can be done by concatenating the different types of descriptors and then training a classifier.
Late Fusion: In this scheme, classifiers are trained for each type of descriptor, then the classification results are fused. Classifiers, such as support vector machines (SVM), can provide a probability estimate for all the classes rather than a hard classification decision. The concept of fusing this probability estimate is called Probabilistic Fusion. For probabilistic fusion, the different descriptors are considered to be conditionally independent. This is a fair assumption for the visual features that we use in this paper, i.e., motion descriptor using gradients and Color SIFT. In probabilistic fusion the individual probabilities are multiplied and normalized. In late fusion, the individual strengths of the descriptors are retained.
Figure 7: Performance on different ways to fuse scene context and motion descriptors on UCF50 dataset.
To perform action recognition, we extract the following information from the video: 1) Scene context information in key frames and 2) motion features in the entire video, as shown in Figure 8. The individual SVMs probability estimates are fused to get the final classification.
Figure 8: Proposed approach.
Confusion Tables for UCF50 dataset
Using the proposed approached i.e., probabilistic fusion of motion descriptor and the proposed scene context descriptor. The performance is 68.20%.
Bag of Words approach on motion descriptor with a codebook of size 1000 gave a classification performance of 53.06%.
Scene context descriptor
Bag of Words approach on the proposed Scene Context descriptor with a codebook of size 1000 gave a classification performance of 47.56%.
Confusion Tables for UCF50 dataset
These confusion tables are related to the experimental results on UCF11 Dataset in the original paper
In this section we show more results on UCF50 dataset by showing the SVMs probability estimates for motion descriptor and proposed scene context descriptor and also the confidence after the probabilistic fusion of both the descriptors. We also test our approach on more videos downloaded from YouTube.
Analysis of the proposed approach
In the paper we claim that the motion descriptor and the proposed scene context descriptor are complimentary to each other and when used in the proposed frames work i.e., the probabilistic fusion the results are better. In order to demonstrate this claim, we provide the SVM confidence results on two test videos taken from the UCF50 dataset.
v_BaseballPitch_g25_c01 (both descriptors perform well)
This video has been correctly classified as ‘Baseball Pitch’ by the motion descriptor, and the scene context descriptor. When both the probability estimates are fused, the video gives high confidence for ‘Baseball Pitch’ as expected.
v_HorseRiding_g25_c04 (one descriptor fails, but the other descriptor helps)
In this video the motion descriptor wrongly classifies this video as ‘Horse Race’, but the scene context descriptor correctly classifies it as ‘Horse Riding’. When the SVMs probability estimates of both the descriptors are fused the video is classified as ‘Horse Riding’ with high confidence.
Testing the proposed method on Unusual/Abnormal/Imitating videos
In this section we test the proposed approach on videos which are abnormal, unusual or fake. In order to do that we download some videos from YouTube which are unusual or abnormal like skiing on sand, punching on the streets and playing an ice grand piano. We do not claim that we do abnormal event detection. We are just showing some observation made during our experiments.
In the following examples the motion descriptor categorizes the video with high confidence for a particular action, but the scene context descriptor doesn’t agree with motion descriptor which can be considered as unusual or abnormal video. The training is done using the videos from UCF50 and tested on new videos downloaded from YouTube. Here we present results on 4 videos.
Example 1: Skiing on Sand
In this video motion descriptor gives a very high confidence for skiing, but the scene context descriptor is not confident with any action class and doesn’t agree with motion descriptor. This kind of situation indicates unusual video. Scene context descriptor expects skiing to occur on snow.
Example 2: Skiing on Sand
This video is similar to previous example. Scene context descriptor expects skiing to occur on snow and not on sand.
Example 3: Imitation playing on an ice piano
In this video the motion descriptor gives a very high confidence for ‘playing piano’, but the scene context descriptor is not confident with any action class and doesn’t agree with the motion descriptor. This kind of situation indicates an unusual video. The scene context descriptor expects a grand piano in the scene, which is missing in this video. The person is imitating playing piano on an ice piano.
Example 4: Punching in wilderness
In this video the motion descriptor gives high confidence for “punch”, but the scene context descriptor is not confident with any action class and doesn’t agree with motion descriptor. For a ‘punch’ action according to UCF50 dataset, the scene context descriptor expects this action to happen in a boxing ring. But in this video the two persons are punching in wilderness.
Kishore K. Reddy and Mubarak Shah, Recognizing 50 Human Action Categories of Web Videos, Machine Vision and Applications (MVAP), September 2012.