Cross-View Image Matching for Geo-localization in Urban Environments
In this paper, we address the problem of cross-view image geo-localization. Specifically, we aim to estimate the GPS location of a query street view image by finding the matching images in a reference database of geotagged bird’s eye view images, or vice versa. To this end, we present a new framework for cross-view image geolocalization by taking advantage of the tremendous success of deep convolutional neural networks (CNNs) in image classification and object detection. First, we employ the Faster R-CNN [1] to detect buildings in the query and reference images. Next, for each building in the query image, we retrieve the k nearest neighbors from the reference buildings using a Siamese network trained on both positive matching image pairs and negative pairs. To find the correct NN for each query building, we develop an efficient multiple nearest neighbors matching method based on dominant sets. We evaluate the proposed framework on a new dataset that consists of pairs of street view and bird’s eye view images. Experimental results show that the proposed method achieves better geo-localization accuracy than other approaches and is able to generalize to images at unseen locations.
The goal of this effort is to develop a novel method which automatically finds the geo- location of an image with an accuracy comparable to GPS devices. In most image matching based geo-localization methods, the geo-location of a query image is obtained by finding its matching reference images from the same view (e.g. street view images), assuming that a reference dataset consisting of geo-tagged images is available. However, since only small number of cities in the world are covered by ground-level imagery, it has not been feasible to scale up ground-level image-to-image matching approaches to global level.
On the other hand, a more complete coverage for overhead reference data such as satellite/areial imagery and digital elevation model (DEM) is available. Therefore, an alternative is to predict the geo-location of a query image by finding its matching reference images from some other views. For example, predict the geo-location of a query street view image based on a reference database of bird’s eye view images, or vice versa.
Figure 1. An example of geo-localization by cross-view image matching. The GPS location of a street view image is predicted by finding its match in a database of geo-tagged bird’s eye view images.
we present a new framework for cross-view image geolocalization. First, we employ the Faster R-CNN [1] to detect buildings in the query and reference images. Next, for each building in the query image, we retrieve the k nearest neighbors from the reference buildings using a Siamese network trained on both positive matching image pairs and negative pairs. To find the correct NN for each query building, we develop an efficient multiple nearest neighbors matching method based on dominant sets. The final geo-localization result is obtained by taking the mean GPS location of selected reference buildings in the dominant set.
Figure 2. The pipeline of the proposed cross-view geo-localization method.
To find the matching image or images in the reference database for a query image, we resort to match buildings between cross-view images since the semantic information of images is more robust to viewpoint variations than appearance features. Therefore, the first step is to detect buildings in images. We employ the Faster R-CNN [1] to achieve this goal due to its state-of-the-art performance for object detection and real-time execution. In our application, the detected buildings in a query image serve as query buildings for retrieving the matching buildings in the reference images. Figure 3 shows examples of the building detection results in both street view and bird’s eye view images. Each detected bounding box is assigned a score.
Figure 3. Building detection examples using Faster R-CNN.
For a query building detected from the previous building detection phase, the next step is to search for its matches in the reference images with known geo-locations. We adopt the Siamese network [2] to learn deep representations in order to distinguish matched and unmatched building pairs in cross-view images. Let $X$ and $Y$ denote the street view and bird’s eye view image training sets respectively. A pair of building images $x \in X$ and $y \in Y$ are used as input to the Siamese network which consists of two deep CNNs sharing the same architecture. $x$ and $y$ can be a matched pair or a unmatched pair. The objective is to automatically learn a feature representation, $f(\cdot)$, that effectively maps $x$ and $y$ from two different views to a feature space, in which matched image pairs are close to each other and unmatched image pairs are far apart. In order to train the network towards this goal, the Euclidean distance of the matched pairs in the feature space should be small (close to 0) while the distance of the unmatched pairs should be large. We employ the contrastive loss: $$ L(x,y,l) = \frac{1}{2} lD^2 + \frac{1}{2}(1-l) \left\{ \operatorname{max} (0, m-D) \right\}^2, $$ where $l \in \{0,1\}$ indicates if $x$ and $y$ is a matched pair, $D$ is the Euclidean distance between the two feature vectors $f(x)$ and $f(y)$, and $m$ is the margin parameter.
2.3 Geo-localization Using Dominant Sets
A simple approach for geo-localization will be, for each detected building in the query image, take the GPS location of its nearest neighbor (NN) in reference images, according to building matching. However, this will not be optimal. In fact, in most cases the nearest neighbor does not correspond to the correct match. Therefore, besides local matching (matching individual buildings), we introduce a global constraint to help make better geo-localization decision. In a given query image, typically there are multiple buildings and their GPS locations should be close. Therefore, the GPS locations of their matched buildings should be close as well.
- For each building in a query image, select k NNs from reference images
- Build a graph $G = (V, E, w)$
$w_{ij}$: weight reflecting similarity between two reference buildings $i$ and $j$
$d_{ij}$: distance between $i$ and $j$’s GPS locations in Cartesian coordinate
$s_{i}$: similarity between query building and reference building $i$
- Dominant set [3] selection
To explore the geo-localization task using cross-view image matching, we have collected a new dataset of street view and bird’s eye view image pairs around downtown Pittsburg, Orlando and part of Manhattan. For this dataset we use the list of GPS coordinates from Google Street View Dataset [4]. There are $1,586$, $1,324$ and $5,941$ GPS locations in Pittsburg, Orlando and Manhattan, respectively. We utilize DualMaps to generate side-by-side street view and bird’s eye view images at each GPS location with the same heading direction. The street view images are from Google and the overhead $45^{\circ}$ bird’s eye view images are from Bing. For each GPS location, four image pairs are generated with camera heading directions of $0^\circ$, $90^\circ$, $180^\circ$ and $270^\circ$. In order to learn the deep network for building matching, we annotate corresponding buildings in every street view and bird’s eye view image pair.
Figure 4. Sampled GPS locations in Pittsburg, Orlando and part of Manhattan.
To evaluate how the proposed approach generalizes to unseen city, we hold out all images from Manhattan exclusively for testing. Part of images from Pittsburg and Orlando are used for training (Figure 4).
- Comparison of the Geo-localization Results
To demonstrate the advantage of using building matching for cross-view image geo-localization, we conduct an experiment by training a Siamese network to match full images directly, which was used in the existing methods such as [5-7]. No building detection is applied to images. Pairs of images taken at the same GPS location with the same camera heading direction are used as positive training pairs to Siamese network. Negative training image pairs are randomly sampled. The network structure and setup is the same as the Siamese network for building matching. During testing, the GPS location of a query image is determined by its best match and no multiple nearest neighbors matching process is necessary. Experiments using 1 image as query and 4 views as query images are performed and the results are illustrated in Figure 5. It is obvious that geo-localization by building matching, which leverages the power of deep learning, outperforms that by matching hand-crafted local feature i.e. SIFT. Also, our proposed approach outperforms random selection by a large margin. Geo-localization by full image matching performs worse compared to building matching using 4 views query images. Moreover, query with 4 images of four directions at one location improves the geolocalization accuracy by a large margin compared to using only 1 image as a query.
Figure 5. Geo-localization results with different error thresholds. (Left) Results of using street view images as query and bird’s eye view images as reference. (Right) Results of using bird’s eye view images as query and street view images as reference.
- Evaluation on Unseen Locations
We also verify if the proposed method can generalize to unseen cities. Specifically, we use images from the city of Pittsburgh and Orlando to train the model (building detection and building matching) and test it on images of the Manhattan area in New York city. As can be seen by the GPS locations in Manhattan area in Figure 4, this geo-localization experiment works on city scale. In addition, tall and crowded buildings are common in Manhattan images, making the geo-localization task very challenging. The geo-localization results in the Manhattan area are shown in Figure 6. The curves for Manhattan images are lower than those in Figure 11 because the test area in this experiment is much larger. The fact that our geolocalization results are still much better than the baseline method – SIFT matching demonstrate the ability of generalization of our proposed approach to unseen cities.
Figure 6. Geo-localization results on Manhattan images with different error thresholds. (Left) Results of using street view images as query and bird’s eye view images as reference. (Right) Results of using bird’s eye view images as query and street view images as reference.
Y. Tian, C. Chen, and M. Shah, “Cross-View Image Matching for Geo-localization in Urban Environments,”
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, July 22-25, 2017.
PDF
Code: Download code
here.
Presentation: The poster is available
here.
[1] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, pages 91-99, 2015.
[2] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In CVPR, volume 1, pages 539-546, 2005.
[3] M. Pavan and M. Pelillo. Dominant sets and pairwise clustering. IEEE TPAMI, 29(1):167-172, 2007.
[4] A. R. Zamir and M. Shah. Image geo-localization based on multiplenearest neighbor feature matching usinggeneralized graphs. IEEE TPAMI, 36(8):1546-1558, 2014.
[5] T. Y. Lin, Y. Cui, S. Belongie, and J. Hays. Learning deep representations for ground-to-aerial geolocalization. In CVPR, pages 5007-5015, June 2015.
[6] N. N. Vo and J. Hays. Localizing and orienting street views using overhead imagery. In ECCV, pages 494-590, 2016.
[7] S. Workman, R. Souvenir, and N. Jacobs. Wide-area image geolocalization with aerial reference imagery. In ICCV, pages 3961-3969, 2015.
Download