Skip to main content

VideoCapsuleNet: A Simplified Network for Action Detection



Duarte, K., Rawat, Y., & Shah, M. (2018). VideoCapsuleNet: A Simplified Network for Action Detection. In Advances in Neural Information Processing Systems (pp. 7610-7619). [BibTeX]


In this paper, we present a novel capsule network for action detection in video. Given a video with a human performing an action, we attempt to both classify and spatiotemporally localize the action. Contrary to current action detection approaches with complex pipelines involving multiple tasks such as tube proposals, optical flow, and tube classification, we propose a simple end-to-end 3D capsule network which jointly performs pixel-wise action segmentation and action classification. The routing-by-agreement in the network inherently models the action representations and various action characteristics are captured by the predicted capsules. This allows us to utilize capsules for action localization and the class-specific capsules predicted by the network are used to determine a pixel-wise localization of actions.

The proposed network is a generalization of capsule network from 2D to 3D, which takes a sequence of video frames as input and outputs a the segmentation map for the action as well as a predicted action class. The 3D generalization drastically increases the number of capsules in the network, making capsule routing computationally expensive. We introduce capsule-pooling in the convolutional capsule layer to address this issue and make the voting algorithm tractable.

We evaluate our method on 3 datasets: UCF-Sports, J-HMDB, and UCF-101 (24 classes). Our network outperforms state-of-the-art methods on all three datasets in v-mAP scores, especially at higher IoU thresholds. Since our network generates segmentations for multiple frames simultaniously, it produces localizations which are more temporally consistent than recent action localization methods. We also perform several ablation studies to understand the various components of our method.


  • Action Localization Accuracies
  • The results reported in the row VideoCapsuleNet* use the ground-truth classification labels when generating the localization maps, so they should not be directly compared with the other state-of-the-art results.

  • Ablation Experiments
  • All ablation experiments are run on UCF-101. The f-mAp and v-mAP use IoU thresholds of α = 0.5. (Lc : classification loss, Ls : localization loss, Lr : reconstruction loss, SC : skip connections from convolutional layers, NCA : no coordinate addition, 4Conv : 4 convolution layers, 8Conv : 8 convolution layers, and Full : the full network.) Unless specified, the network uses only the classification and localization losses.

  • Qualitative Results