Skip to main content

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

Figure 1: Illustration of the idea: We divide the video into equal length clips and generate action proposal tube through Tube Proposal Network. After linking proposals from different clips together, Tube of Interest Pooling is applied on the linked proposals. Finally, the labels are predict through the classification model and bounding boxes are generated through the regression model.

Introduction

Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis has been limited due to complexity of video data and lack of annotations. Previous convolutional neural networks (CNN) based video action detection approaches usually consist of two major steps: frame-level action proposal generation and association of proposals across frames. Also, most of these methods employ two-stream CNN framework to handle spatial and temporal feature separately. In this paper, we propose an end-to-end deep network called Tube Convolutional Neural Network (T-CNN) for action detection in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D Convolutional Network (ConvNet) features. Finally, the tube proposals of different clips are linked together employing network flow and spatio-temporal action detection is performed using these linked video proposals. Extensive experiments on several video datasets demonstrate the superior performance of TCNN for classifying and localizing actions in both trimmed and untrimmed videos compared to state-of-the-arts.

Motivation

Faster R-CNN was developed by introducing a region proposal network, which has been extensively used to produce excellent results for object detection in images. A natural generalization of the RCNN from 2D images to 3D spatio-temporal volumes is to study their effectiveness for the problem of action detection in videos. A straightforward spatio-temporal generalization of the R-CNN approach would be to treat action detection in videos as a set of 2D image detections using faster RCNN. However, unfortunately, this approach does not take the temporal information into account and is not sufficiently expressive to distinguish between actions.

Inspired by the pioneering work of faster R-CNN, we propose Tube Convolutional Neural Network (T-CNN) for action detection. To better capture the spatio-temporal information of video, we exploit 3D ConvNet for action detection, since it is able to capture motion characteristics in videos and shows promising result on video action recognition. We propose a novel framework by leveraging the descriptive power of 3D ConvNet. In our approach, an input video is divided into equal length clips first. Then, the clips are fed into Tube Proposal Network (TPN) and a set of tube proposals are obtained. Next, tube proposals from each video clip are linked according to their actionness scores and overlap between adjacent proposals to form a complete tube proposal for spatio-temporal action localization in the video. Finally, the Tube-of-Interest (ToI) pooling is applied to the linked action tube proposal to generate a fixed length feature vector for action label prediction.

Framework

The proposed approach for action localization begins with diving the video into equal length clips. For each clip, several clip level action proposals are generated through Tube Proposal Network (see Figure 2). Then, Tube of Interest Pooling layer is used to normalize the shape of 3D ConvNet feature map (see Figure 3). Finally the dectection prediction is generated through a classification model and regression model.

 

Figure 2: Tube Proposal Network

Figure 3: Tube of interest pooling.

 

Results

We evaluate the proposed approach on three challenging action localization datasets: UCF-Sports, JHMDB, UCF-101(THUMOS’13) and THUMOS’14. The qualitative and quantitative results can be seen below:

 

 

Figure 4: Action detection results by T-CNN on UCF-Sports, JHMDB, UCF-101 and THUMOS’14. Red boxes indicate the detections in the corresponding frames, and green boxes denote ground truth. The predicted label is overlaid.

 

 

Figure 5: The ROC and AUC curves for UCF-Sports Dataset are shown in (a) and (b), respectively. The results are shown for Jain et al. (green), Tian et al. (purple), Soomro et al. (blue), Wang et al. (yellow), Gkioxari et al. (cyan) and Proposed Method (red). (c) shows the mean ROC curves for four actions of THUMOS’14. The results are shown for Sultani et al. (green), proposed method (red) and proposed method without negative mining (blue).

YouTube Presentation

[YouTube]

Code

[https://github.com/ruihou/mtcnn]

 

Related Publication

Rui Hou, Chen Chen and Mubarak Shah, Tube Convolutional Neural Network (T-CNN) for Action Detection in VideosInternational Conference in Computer Vision (ICCV), 2017. bibtex