Skip to main content

LARNet: Latent Action Representation for Human Action Synthesis



Naman Biyani, Aayush J Rana, Shruti Vyas, Yogesh S Rawat, LARNet: Latent Action Representation for Human Action Synthesis, The 32nd British Machine Vision Conference, 2021.


We present LARNet, a novel end-to-end approach for generating human action videos. A joint generative modeling of appearance and dynamics to synthesize a video is very challenging and therefore recent works in video synthesis have proposed to decompose these two factors. However, these methods require a driving video to model the video dynamics. In this work, we propose a generative approach instead, which explicitly learns action dynamics in latent space avoiding the need of a driving video during inference. The generated action dynamics is integrated with the appearance using a recurrent hierarchical structure which induces motion at different scales to focus on both coarse as well as fine level action details. In addition, we propose a novel mix-adversarial loss function which aims at improving the temporal coherency of synthesized videos. We evaluate the proposed approach on four real-world human action datasets demonstrating the effectiveness of the proposed approach in generating human actions.

We demonstrate the overview of the proposed LARNet framework in above figure. Given an actor image x0, action class ya, position encoding pe, and noise z, the network generates corresponding action video v. The motion generator Gm generates action representation em in latent space in the action dynamics module. Next, the motion integrator MI recurrently integrates em with the appearance ea in a latent space to produce video features ev which is used to synthesize the action video v. The complete network is trained end-to-end with the help of multiple objective functions.

LARNet explicitly models the action dynamics in latent space by approximating it to motion from real action videos. This enables effective decomposition of appearance and motion while avoiding the need of any driving video during inference. LARNet is trained end-to-end in an adversarial framework, optimizing multiple objectives. We make the following novel contributions in this work:
1. We propose a generative approach for human action synthesis that leverages the decomposition of content and motion by explicit modeling of action dynamics.
2. We propose a hierarchical recurrent motion integration approach which operates at multiple scales focusing on both coarse level and fine level details.
3. We propose mix-adversarial loss, a novel objective function for video synthesis which aims at improving the temporal coherency in the synthesized videos.


We demonstrate the effectiveness of LARNet and highlight the benefits of its main components (action representation learning, hierarchical motion integrator, and mix-adversarial loss) via quantitative and qualitative evaluation. We experimented with four real-world human action datasets including NTU-RGB+D, Penn Action, KTH and UTD-MHAD with a resolution of 112×112. We evaluate the quality of the generated videos using frame level Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) against the ground-truth video. Apart from these, we also evaluate the realism of the generated videos using video level FVD and frame level FID scores.

  • Comparison with existing conditional video synthesis methods on the NTU-RGB+D dataset. + and ++ uses motion from a driving video where ++ uses a driving video instead of generated action during inference while + is trained using a driving video without action dynamics module.
  • Comparison with existing conditional video synthesis methods on Penn Action, UTD-MHAD and KTH dataset. We compare LARNet with G3AN and Imaginator and show that LARNet consistently outperforms prior methods on different datasets.

For NTU-RGB+D dataset, we observe that Imaginator [66] has a slightly better performance in terms of FID score which could be due to the use of frame level adversarial loss. However, it is important to note that FID only measures frame level quality whereas FVD is more focused on video dynamics. LARNet outperforms all other methods in terms of FVD score. Next, we compare the performance on small scale datasets including Penn Action, UTD-MHAD and KTH to evaluate the generalization capability of LARNet. Even on small sized datasets LARNet consistently outperforms these two methods on all four metrics.

We also demonstrate the generated videos using LARNet on four different datasets. We observe that the generated videos capture the action dynamics for a wide range of human actions. This is true even for those actions where only a slight movement of arms is involved, such as ‘hand waving’ and ‘eating’. These results show that LARNet can consistently generate the background content of still objects while synthesizing reasonable action dynamics.

In this work, we present a novel approach for generating human actions from an input image. The proposed framework predicts human actions conditioned on action semantics and utilizes a generative mechanism which estimates latent action representation. The latent action representation is explicitly learned with the help of a similarity and adversarial loss formulation. This learned latent representation is then used to generate an action video which is optimized using multiple objectives, including a novel mix-adversarial loss.