Skip to main content

A Recurrent Transformer Network for Novel View Action Synthesis



Kara Marie Schatz, Erik Quintanilla, Shruti Vyas and Yogesh S Rawat. “A Recurrent Transformer Network for Novel View Action Synthesis.” 16th European Conference on Computer Vision, 2020.


In this work, we address the problem of synthesizing human actions from novel views. Given an input video of an actor performing some action, we aim to synthesize a video with the same action performed from a novel view with the help of an appearance prior. We propose an end-to-end deep network to solve this problem. The proposed network utilizes the change in viewpoint to transform the action from the input view to the novel view in feature space. The transformed action is integrated with the target appearance using the proposed recurrent transformer network, which provides a transformed appearance for each time-step in the action sequence. The encoded action features are also used to determine action key-points in an unsupervised approach, which helps the network to focus more on the action region of the video. We demonstrate the effectiveness of the proposed method through extensive experiments conducted on a large-scale multi-view action recognition NTU-RGBD+D dataset. In addition, we show that our framework can also synthesize a video from a novel viewpoint with an entirely different background scene or actor.

Given a source video Vi and target prior Pi, the proposed framework transforms the source action features mi to target view action features mj and use them to transform the target prior aj for synthesizing target view action video Vj. The network also utilize action key-points KPj which are predicted via unsupervised approach, to focus on activity regions in the video.


The synthesized video frames from a novel view with a different actor and a different background. For each sample, the top row shows 8 frames of the ground truth video for the novel view and the bottom row shows a prior from another view followed by synthesized video frames for the novel view. For each of these, frames 1, 3, 5, 7, 9, 11, 13, and 15 are shown.

Synthesized video frames using the proposed model. Two samples are shown, and for each, the top row contains 8 frames of the ground truth video for the novel view and the bottom row contains the same 8 frames of the generated video for the novel view. Our model predicts 16 frames in a video and for each of these examples, frames 1, 3, 5, 7, 9, 11, 13, and 15 are shown. More examples are provided in supplementary.

A comparison of SSIM scores of all the combinations of three views along with the average score with existing approaches. The scores for VDG [12] and PG2 [21] are shown as reported by the authors of VDNet [17].

Ablation experiments to study the impact of various components of the network on video synthesis. AC-Trans: action transformation, HI-Trans: hierarchical transformation, and AP-Trans: appearance transformation.

Related Publications

[1] Christian Ledig, Lucas Theis, Ferenc Husza ́r, Jose Caballero, et al. Photo-realistic single image super-resolution using a generative adversarial network. In IEEE conference on CVPR, 2017.

[2] Krishna Regmi and Ali Borji. Cross-view image synthesis using conditional gans. In IEEE Conference on CVPR, 2018.

[3] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene representation and rendering. Science, 2018.

[4] Mohamed Ilyes Lakhal, Oswald Lanz, and Andrea Cavallaro. View-lstm: Novel-view video synthesis through view decomposition. In The IEEE International Conference on Computer Vision (ICCV), October 2019.