Skip to main content

Cross-View Image Synthesis



Krishna Regmi and Ali Borji. “Cross-View Image Synthesis Using Geometry-Guided Conditional GANs.” Computer Vision and Image Understanding (CVIU), 2019. [BibTeX]

Krishna Regmi and Ali Borji. “Cross-View Image Synthesis Using Conditional GANs.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, June 18-22, 2018. [BibTeX]


  • The first work to synthesize outdoor natural scene images between aerial and street view, conditioned on images in one view.
  • Exploit semantic segmentation maps of target view images to regularize the training; they are not required during testing.
  • Using homography as a preprocessing step to guide conditional GANs significantly helps cross-view synthesis tasks (CVIU version).
  • Our extensive experiments with various methods provide significant analysis of the cross-view synthesis approaches.


Learning to generate natural scenes has always been a challenging task in computer vision. It is even more painstaking when the generation is conditioned on images with drastically different views. This is mainly because understanding, corresponding, and transforming appearance and semantic information across the views is not trivial. In this paper, we attempt to solve the novel problem of cross-view image synthesis, aerial to street-view and vice versa, using conditional generative adversarial networks (cGAN). Two new architectures called Crossview Fork (XFork) and Crossview Sequential (X-Seq) are proposed to generate scenes with resolutions of 64×64 and 256×256 pixels. X-Fork architecture has a single discriminator and a single generator. The generator hallucinates both the image and its semantic segmentation in the target view. X-Seq architecture utilizes two cGANs. The first one generates the target image which is subsequently fed to the second cGAN for generating its corresponding semantic segmentation map. The feedback from the second cGAN helps the first cGAN generate sharper images. Both of our proposed architectures learn to generate natural images as well as their semantic segmentation maps. The proposed methods show that they are able to capture and maintain the true semantics of objects in source and target views better than the traditional image-to-image translation method which considers only the visual appearance of the scene. Extensive qualitative and quantitative evaluations support the effectiveness of our frameworks, compared to two state of the art methods, for natural scene generation across drastically different views.

Figure 1: Qualitative Results showing the images generated by baseline (Pix2pix) and the proposed X-Fork and X-Seq methods in a2g (aerial to ground) and g2a (ground to aerial) directions on Dayton Dataset.


We propose novel methods for solving the task of cross-view image synthesis using GANs. We synthesize an image from an aerial viewpoint for an input image of ground-level view from the same place and vice versa. We propose X-Fork and X-Seq architectures (details in the paper) to generate realistic cross-view images by regularizing the training using semantic segmentation maps during the training only. We also propose to use Homography to translate aerial images to ground view in the CVIU paper in order to preserve the pixels from the input view image to the target view image. The network architectures are shown below.

Figure 2: Block diagram showing the X-Fork architecture.

Figure 3: Generator of X-Fork architecture.

Figure 4: Block diagram showing the X-Seq architecture.


Following datasets were used in the papers:

  • Dayton
  • SVA


  • Quantitative Evaluation:

    1. Inception Score: We used a pretrained AlexNet model trained on Places dataset with 365 categories to obtain the classification probabilities and use those scores to compute Inception Score
    2. Top-k prediction accuracy: We predicted the classes for generated images using a pretrained AlexNet model trained on Places dataset with 365 categories. Top-1 and Top-5 accuracies are reported.
    3. KL(model || data): KL divergence between the model generated images and the real data distribution is used for quantitative analysis of our work. Lower value means the real and synthesized images are closer to each other.
    4. SSIM, PSNR and Sharpness Difference: SSIM measures the similarity between the images based on their luminance, contrast and structural aspects. PSNR measures the peak signal-to-noise ratio between two images to assess the quality of a transformed (generated) image compared to its original version. Sharpness difference measures the loss of sharpness during image generation.
  • Qualitative Evaluation:

    1. Dayton Dataset: The qualitative results on Dayton Dataset is shown in Figure 1 above.
    2. CVUSA Dataset:

      Figure 5: Qualitative Results showing the images generated by baselines (Pix2pix, X-SO: Stacked Output) and the proposed X-Fork and X-Seq methods in a2g (aerial to ground) direction on CVUSA Dataset.

    3. SVA Dataset:

      Figure 6: Qualitative Results showing the images generated by baselines and the proposed Fork and Seq methods in a2g (aerial to ground) direction with and without Homography on SVA Dataset. [H – Methods utilizing homography, X – methods not using homography].