Skip to main content

Multi-view Action Recognition using Cross-view Video Prediction

Publication

Shruti Vyas, Yogesh Rawat and Mubarak Shah. “Multi-view Action Recognition using Cross-view Video Prediction.” 16th European Conference on Computer Vision, 2020. [Supplementary]

Overview

In this work, we address the problem of action recognition in a multi-view environment. Most of the existing approaches utilize pose information for multi-view action recognition. We focus on RGB modality instead and propose an unsupervised representation learning framework, which encodes the scene dynamics in videos captured from multiple viewpoints via predicting actions from unseen views. The framework takes multiple short video clips from different viewpoints and time as input and learns a holistic internal representation which is used to predict a video clip from an unseen viewpoint and time. The ability of the proposed network to render unseen video frames enables it to learn a meaningful and robust representation of the scene dynamics.

An overview of the proposed representation learning framework. An action is captured from different viewpoints (v1, v2, v3, …, vn) providing observations (o1, o2, o3, …, on). Video clips from two viewpoints (v1 and v2) at arbitrary times (t1 and t2) are used to learn a representation (r) for this action, employing the proposed representation learning network (RL-NET). The learned representation (r) is then used to render a video from an arbitrary query viewpoint (v3) and time (t3) using the proposed video rendering network (VR-NET). The representation thus learned is used for action recognition using a classification network (CL-NET).

Outline of the proposed unsupervised cross-view video rendering framework shown below.  A: A collection of observations (o) for a given action from different viewpoints. B: Training clips from the set of observations captured from different viewpoints and at different times. C: Representation learning network (RL-NET), which takes video clips from different viewpoints and time as input and learns a representation r. D: ENC-NET is used to learn individual video encodings e^k conditioned on its viewpoint v^k and time t^k. E: The blending network (BL-NET) combines encodings learned from different video clips into a unified representation of r. F: The representation r is used to predict a video from query viewpoint v^q and time t^q using VR-NET. G: The representation r can also be used for action classification using CL-NET. 3D-U refers to 3D convolutions combined with upsampling and U refers to upsampling.

Details of different training strategies (M-1, M-2, and M-3) which are used to study the effect of video rendering on representation learning for action classification. All three variations use the same testing strategy.

The details of the training and testing configuration (Strategy M-1) used for cross-view and cross-subject experiments on NTU-RGB+D dataset.

Evaluations

We evaluate the effectiveness of the learned representation for multi-view video action recognition in a supervised approach. We observe a significant improvement in the performance with RGB modality on NTU-RGB+D dataset, which is the largest dataset for multi-view action recognition. The proposed framework also achieves state-of-the-art results with depth modality, which validates the generalization capability of the approach to other data modalities.

A comparison of cross-subject (CS) and cross-view (CV) action recognition performance on NTU-RGB+D dataset for RGB modality. RGB-S: using both RGB and skeleton modalities, RGB-DS: using RGB, depth and skeleton modalities.

A comparison of cross-subject (CS) and cross-view (CV) action recognition performance on NTU-RGB+D dataset.

A comparison of cross-subject (CS) and cross-view (CV) action recognition on N-UCLA MultiviewAction3D dataset.

A comparison of t-SNE visualization of representations learned with: a) Variational Autoencoder (VAE) and b) proposed RL-NET for a subset of 10 activities on NTU-RGB+D dataset. The shown images are the first frame of the video clips. We observe that VAE is indifferent to view awareness of activities and mostly clusters videos with similar visual content. On the other hand, the proposed method is able to cluster activities from different views close to each other even if they have different viewpoints. Effect of multi-view learning: t-SNE visualization of activity representations for a subset of 10 activities on full NTU-RGB+D dataset using: c) one input view and d) all three views. The learned representation improves with the availability of multiple views using the same network.

Related Publications

[1] Vyas, Shruti, Yogesh S. Rawat, and Mubarak Shah. “Time-Aware and View-Aware Video Rendering for Unsupervised Representation Learning.” arXiv preprint arXiv:1811.10699 (2018).

[2] Rao, Cen, and Mubarak Shah. “View-invariance in action recognition.” Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. Vol. 2. IEEE, 2001.

[3] Rao, Cen, Alper Yilmaz, and Mubarak Shah. “View-invariant representation and recognition of actions.” International Journal of Computer Vision 50.2 (2002): 203-226.

[4] Liu, Jingen, et al. “Cross-view action recognition via view knowledge transfer.” CVPR 2011. IEEE, 2011.