Skip to main content

Bridging the Domain Gap for Ground-to-Aerial Image Matching

 

Publication

Krishna Regmi and Mubarak Shah. “Bridging the Domain Gap for Ground-to-Aerial Image Matching.” International Conference on Computer Vision (ICCV 2019), Seoul, South Korea, Oct 27-Nov 2, 2019. [BibTeX]

Overview

The visual entities in cross-view (e.g. ground and aerial) images exhibit drastic domain changes due to the differences in viewpoints each set of images is captured from. Existing state-of-the-art methods address the problem by learning view-invariant images descriptors. We propose a novel method for solving this task by exploiting the generative powers of conditional GANs to synthesize an aerial representation of a ground-level panorama query and use it to minimize the domain gap between the two views. The synthesized image being from the same view as the reference (target) image, helps the network to preserve important cues in aerial images following our Joint Feature Learning approach. We fuse the complementary features from a synthesized aerial image with the original ground-level panorama features to obtain a robust query representation. In addition, we employ multi-scale feature aggregation in order to preserve image representations at different scales useful for solving this complex task. Experimental results show that our proposed approach performs significantly better than the state-of-the-art methods on the challenging CVUSA dataset in terms of top-1 and top-1% retrieval accuracies. Furthermore, we evaluate the generalization of the proposed method for urban landscapes on our newly collected cross-view localization dataset with geo-reference information.

Figure 1: Overall Pipeline: First we synthesize the cross-view images (aerial from ground) as shown in upper panel. Then we use the synthesized aerial image to learn features for image matching.

Problem and Motivation

The goal of this work is to develop a novel cross-view image matching network for geo-localization by first synthesizing the cross view images and use them to aid in the matching pipeline. Most existing methods learn the discriminative features from the images in aerial and ground (varying) views but this work motivates to first synthesize cross-view images that can depict the scene of the target view.

Experimental Setup

We propose a novel method for solving this task by exploiting the generative powers of conditional GANs to synthesize an aerial representation of a ground-level panorama query and use it to minimize the domain gap between the two views. The synthesized image being from the same view as the reference (target) image, helps the network to preserve important cues in aerial images following our Joint Feature Learning approach. We fuse the complementary features from a synthesized aerial image with the original ground-level panorama features to obtain a robust query representation. In addition, we employ multi-scale feature aggregation in order to preserve image representations at different scales useful for solving this complex task.

Evaluations

  • CVUSA Dataset
  • A matching is successful for a query street-view image if the correct match (aerial image) lies within a set of closest images in Euclidean distance of the representative features.

    Figure 2: Comparison of different versions of our methods with CVM-Net I and CVM-Net II [18] on CVUSA dataset [45]. For references [18] and [45], check paper.

  • UCF-OP Dataset
  • Here, we measure the success of localization based on error threshold as shown in Figure 3. The query image is correctly geo-localized if it is located within a threshold distance in meters from its ground truth position.

    Figure 3: Geo-localization results on the UCF-OP dataset with different error thresholds.

Related Publications

[1] Yicong Tian, Chen Chen and Mubarak Shah. Cross-View Image Matching for Geo-localization in Urban Environments

[2] Amir Roshan Zamir. Geo-Spatial Localization Using Google Street View