Novel View Video Prediction using Dual Representation
Sarah Shiraz, Krishna Regmi, Shruti Vyas, Yogesh S. Rawat, Mubarak Shah, Novel View Video Prediction using Dual Representation, IEEE International Conference on Image Processing, 2021.
In this paper, we address the problem of novel view video prediction without using explicit estimation of 3D structures and geometry based transformations. Most of the video prediction works [6, 7, 8] in the literature are mainly limited to single viewpoint with the aim of improved dynamics and sharper frames. Only recently, the authors in  tackled the task of novel view video prediction by proposing the use of depth maps and human pose information of the novel view as priors. However, the availability of these priors from the query viewpoint is a big assumption which may not always be valid and requires extra computation.
Considering this limitation, we propose a novel learning based method which generates human action videos from a novel view without any priors and of better video quality. Proposed method takes video clips from multiple viewpoints and utilizes a two-stream approach for learning the dual representations: global and view-dependent. The global representation focuses on the general features, such as scene structure, whereas the view-dependent representation focuses on finer details specific to the query view. We evaluate our approach with extensive ablations and report state-of-the-art results on two real world datasets, CMU Panoptic  and NTU-RGB+D . Contributions: (1) We propose a novelview video prediction framework which can generate videos from unseen query viewpoints. (2) The proposed network integrates both global as well as view-dependent representation for effective novel-view video prediction. (3) We provide extensive evaluation of our approach on two real-world datasets, CMU Panoptic  and NTU-RGB+D , achieving significant improvement over the existing methods without using any prior information about the query views. We report 26.1% improvement in SSIM, 13.6% improvement in PSNR and 60% improvement in FVD scores over state-of-the-art viewLSTM  on CMU-Panoptic Dataset .
The overall architecture of the proposed network is shown above. The Video Encoder VE takes multiple input clips ci, i ∈ 1,2,.., N from different viewpoints, vi to learn the encoded features, ei. The Global Representation Block, VG uses the absolute view parameters of the input, vi. The features, ei are aggregated with view embeddings from view embedder, VE using a 3D convLSTM block. The Query Network, Q retrieves features, r’g according to the query view vq. The view dependent block, VT on the other hand uses relative view information, using both partial input view vpi and partial query view vpq , which adds fine view-dependent details to the video. The dual representation, r thus obtained by aggregating the retrieved ￼global representation, r’g and view dependent representation, rt is used to synthesize the query view video, c’q.
- Large-scale Multiview Experiment
- Novel View Video Prediction: NTU-RGB+D
We evaluated our network on the following two datasets. (1) CMU Panoptic Dataset  is a large-scale multi-activity real-world dataset with a total of 521 viewpoints. The cameras are installed in 20 hexagonal panels, with 24 cameras in each panel. The provided viewpoint information vector, vi , consists of camera location coordinates (3D), principal point offset (2D), focal length (2D), distortion coefficients (5D), horizontal-pan (1D) and vertical-pan (1D), therefore d = 14. (2) NTU-RGB+D  dataset contains 60 action classes recorded from 3 viewpoints: at horizontal angles of -45 degrees, 0 degree, 45 degrees. We use the cross subject evaluation split of 40,320 and 16,560 clips for training and testing respectively. The viewpoint information vector, vi , for this dataset, consists of camera height (1D), camera distance (1D), horizontal-pan (1D) and vertical-pan (1D), viewpoint angle (1D), therefore d = 5.
CMU panoptic  has a large number of views which makes it a lucrative dataset for cross-view video prediction. Considering this, we experiment on a large-scale setting, using a maximum of 72 views for training and testing. We examine the impact of number of training videos on the quality of synthesized videos and compare with SOTA.
In the first setup, we used 72 camera viewpoints. Specifically, we select 3 panels (Number 4, 5 and 17) and use all 24 views from each panel, totalling 72 views. We use 56 viewpoints for training, and 16 viewpoints for the testing with a split of 6,244 training and 1,132 test samples. Training with so many viewpoints can be challenging due to computation cost. So for each iteration we randomly selected, six views from one of the panels, five input views and a query view. Testing followed a similar setup. For each panel, we fix the input viewpoints from view 1 to view 5 i.e. (v1,v2,v3,v4,v5) and test on the remaining 19 viewpoints i.e. the query view points are from v6 to v24. SSIM  and PSNR for frame quality evaluation and Frechet Video Distance (FVD)  to measure the video quality. Table 4 shows that the proposed method is successful at synthesizing high quality video clips from multiple query views. Owing to the weight sharing nature of the proposed approach, we are able to change the number of testing input clips irrespective of training. Figure 1 shows qualitative results with two fixed input views and for different query views. The predicted video frames capture the motion and viewpoints correctly.
Since the NTU-RGB+D dataset  consists of three camera viewpoints, our experiment uses two views as input and the third view as a query. Qualitative results along with the ground truth are visualized in Figure below. Consistent with SSIM and PSNR values in Table 2, the network is able to synthesize the persons in the frames as well as the background details of the query viewpoint correctly. We show frames at three different time steps and observe that the synthesized frames are able to capture the motion similar to the ground truth which is also supported by low FVD scores. As ablation studies, we evaluate the role of global and view-dependent blocks separately. First we test the fully trained network by using one block at a time and call them VD(tested) and GR(tested). Secondly, we train both blocks separately and call VD(trained) and GR(trained). The results are shown in Table 2 confirming the effect of the two blocks together produces better output.
Table 2 (Row 3-4) shows detailed quantitative results for network ablations for single and multiple input-view settings. The quantitative results are consistent for both single and multiple input-view setups, and for all input and output view combinations. The results show that View Dependent and Global Representation streams learn complementary features that are aggregated by the full model (Table 2 Row 5) and generate better results.
 Mohamed Ilyes Lakhal, Oswald Lanz, and Andrea Cavallaro, “View-lstm: Novel-view video synthesis through view decomposition”, in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp.7577–7587. [PDF] [BibTeX]
 Olivia Wiles, Georgia Gkioxari, Richard Szeliski, and Justin Johnson, “Synsin: End-to-end view synthesis from a single image”, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [PDF] [BibTeX]
 Shruti Vyas, Yogesh S Rawat, and Mubarak Shah, “Multi-view action recognition using cross-view video prediction”, in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16. Springer, 2020, pp. 427–444. [PDF] [BibTeX]
Note: References displayed under the quantitative analysis tables can be accessed using the paper.