Cross-View Image Matching for Geo-localization in Urban Environments
Yicong Tian
Chen Chen
Mubarak Shah
Center for Research in Computer Vision (CRCV), University of Central Florida (UCF)
tyc.cong@gmail.com
chenchen870713@gmail.com
shah@crcv.ucf.edu
Paper
Y. Tian, C. Chen, and M. Shah, "Cross-View Image Matching for Geo-localization in Urban Environments,"
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, July 22-25, 2017.
PDF
DATASET: Download dataset
here.
Dataset description
CODE: Download code
here.
PRESENTATION: The poster is available
here.
Abstract
In this paper, we address the problem of cross-view
image geo-localization. Specifically, we aim to estimate
the GPS location of a query street view image by finding
the matching images in a reference database of geotagged
bird's eye view images, or vice versa. To this end,
we present a new framework for cross-view image geolocalization
by taking advantage of the tremendous success
of deep convolutional neural networks (CNNs) in image
classification and object detection. First, we employ
the Faster R-CNN [1] to detect buildings in the query and
reference images. Next, for each building in the query image,
we retrieve the k nearest neighbors from the reference
buildings using a Siamese network trained on both positive
matching image pairs and negative pairs. To find the correct
NN for each query building, we develop an efficient multiple
nearest neighbors matching method based on dominant
sets. We evaluate the proposed framework on a new dataset
that consists of pairs of street view and bird's eye view images.
Experimental results show that the proposed method
achieves better geo-localization accuracy than other approaches
and is able to generalize to images at unseen locations.
1. Problem & Motivation
The goal of this effort is to develop a novel method which automatically finds the geo- location of an image with an accuracy comparable to GPS devices.
In most image matching based geo-localization methods, the geo-location of a query image
is obtained by finding its matching reference images from the same view (e.g. street view images),
assuming that a reference dataset consisting of geo-tagged images is available. However, since only
small number of cities in the world are covered by ground-level imagery, it has not been feasible to
scale up ground-level image-to-image matching approaches to global level.
On the other hand, a more complete coverage for overhead reference data such as satellite/areial imagery
and digital elevation model (DEM) is available. Therefore, an alternative is to predict the geo-location of a
query image by finding its matching reference images from some other views. For example, predict the geo-location
of a query street view image based on a reference database of bird's eye view images, or vice versa.

Figure 1. An example of geo-localization by cross-view image
matching. The GPS location of a street view image is predicted
by finding its match in a database of geo-tagged bird's eye view
images.
2. Method
we present a new framework for cross-view image geolocalization. First, we employ
the Faster R-CNN [1] to detect buildings in the query and
reference images. Next, for each building in the query image,
we retrieve the k nearest neighbors from the reference
buildings using a Siamese network trained on both positive
matching image pairs and negative pairs. To find the correct
NN for each query building, we develop an efficient multiple
nearest neighbors matching method based on dominant sets. The final
geo-localization result is obtained by taking the mean GPS
location of selected reference buildings in the dominant set.

Figure 2. The pipeline of the proposed cross-view geo-localization method.
2.1 Building Detection
To find the matching image or images in the reference
database for a query image, we resort to match buildings
between cross-view images since the semantic information
of images is more robust to viewpoint variations than appearance
features. Therefore, the first step is to detect
buildings in images. We employ the Faster R-CNN [1] to
achieve this goal due to its state-of-the-art performance for
object detection and real-time execution. In our application,
the detected buildings in a query image serve as query
buildings for retrieving the matching buildings in the reference
images. Figure 3 shows examples of the
building detection results in both street view and bird's
eye view images. Each detected bounding box is assigned
a score.

Figure 3. Building detection examples using Faster R-CNN.
2.2 Building Matching
For a query building detected from the previous building
detection phase, the next step is to search for its matches in
the reference images with known geo-locations. We adopt the Siamese network [2] to learn deep representations in
order to distinguish matched and unmatched building pairs
in cross-view images.
Let $X$ and $Y$ denote the street view and bird's eye view image training sets respectively.
A pair of building images $x \in X$ and $y \in Y$ are used as input to the Siamese network
which consists of two deep CNNs sharing the same architecture. $x$ and $y$ can be a matched
pair or a unmatched pair. The objective is to automatically learn a feature
representation, $f(\cdot)$, that effectively maps $x$ and $y$ from two different
views to a feature space, in which matched image pairs are close to each other
and unmatched image pairs are far apart.
In order to train the network towards this goal, the Euclidean distance of the
matched pairs in the feature space should be small (close to 0) while the distance
of the unmatched pairs should be large. We employ the contrastive loss:
$$
L(x,y,l) = \frac{1}{2} lD^2 + \frac{1}{2}(1-l) \left\{ \operatorname{max} (0, m-D) \right\}^2,
$$
where $l \in \{0,1\}$ indicates if $x$ and $y$ is a matched pair, $D$ is the Euclidean distance between the two feature vectors $f(x)$ and $f(y)$, and $m$ is the margin parameter.
2.3 Geo-localization Using Dominant Sets
A simple approach for geo-localization will be, for each detected building in the query image,
take the GPS location of its nearest neighbor (NN) in reference images, according to building matching.
However, this will not be optimal. In fact, in most cases the nearest neighbor does not correspond
to the correct match. Therefore, besides local matching (matching individual buildings),
we introduce a global constraint to help make better geo-localization decision.
In a given query image, typically there are multiple buildings and their GPS locations should be close.
Therefore, the GPS locations of their matched buildings should be close as well.
For each building in a query image, select k NNs from reference images

Build a graph $G = (V, E, w)$

$w_{ij}$: weight reflecting similarity between two reference buildings $i$ and $j$
$d_{ij}$: distance between $i$ and $j$'s GPS locations in Cartesian coordinate
$s_{i}$: similarity between query building and reference building $i$
Dominant set [3] selection

3. Dataset
To explore the geo-localization task using cross-view image matching,
we have collected a new dataset of street view and bird's eye view image pairs around downtown Pittsburg,
Orlando and part of Manhattan. For this dataset we use the list of GPS coordinates from Google Street View Dataset [4].
There are $1,586$, $1,324$ and $5,941$ GPS locations in Pittsburg, Orlando and Manhattan, respectively.
We utilize DualMaps to generate side-by-side street view and bird's eye view images at each GPS location with the same heading direction.
The street view images are from Google and the overhead $45^{\circ}$ bird's eye view images are from Bing.
For each GPS location, four image pairs are generated with camera heading directions of $0^\circ$, $90^\circ$, $180^\circ$ and $270^\circ$.
In order to learn the deep network for building matching, we annotate corresponding buildings in every street view and bird's eye view image pair.

Figure 4. Sampled GPS locations in Pittsburg, Orlando and part of Manhattan.
4. Results
To evaluate how the proposed approach generalizes to
unseen city, we hold out all images from Manhattan exclusively
for testing. Part of images from Pittsburg and Orlando
are used for training (Figure 4).
Comparison of the Geo-localization Results
To demonstrate the advantage of using building matching for
cross-view image geo-localization, we conduct an experiment
by training a Siamese network to match full images
directly, which was used in the existing methods such as
[5-7]. No building detection is applied to images.
Pairs of images taken at the same GPS location with the
same camera heading direction are used as positive training
pairs to Siamese network. Negative training image pairs are
randomly sampled. The network structure and setup is the
same as the Siamese network for building matching. During
testing, the GPS location of a query image is determined by
its best match and no multiple nearest neighbors matching
process is necessary. Experiments using 1 image as query
and 4 views as query images are performed and the results
are illustrated in Figure 5. It is obvious that geo-localization
by building matching, which leverages the power of deep
learning, outperforms that by matching hand-crafted local
feature i.e. SIFT. Also, our proposed approach outperforms
random selection by a large margin.
Geo-localization by full image
matching performs worse compared to building matching
using 4 views query images.
Moreover, query with 4
images of four directions at one location improves the geolocalization
accuracy by a large margin compared to using
only 1 image as a query.

Figure 5. Geo-localization results with different error thresholds.
(Left) Results of using street view images as query and bird's eye
view images as reference. (Right) Results of using bird's eye view
images as query and street view images as reference.
Evaluation on Unseen Locations
We also verify if the proposed method can generalize
to unseen cities. Specifically, we use images from
the city of Pittsburgh and Orlando to train the model (building
detection and building matching) and test it on images
of the Manhattan area in New York city.
As can be seen by the GPS locations in Manhattan area
in Figure 4, this geo-localization experiment works on city
scale. In addition, tall and crowded buildings are common
in Manhattan images, making the geo-localization task very
challenging. The geo-localization results in the Manhattan
area are shown in Figure 6. The curves for Manhattan images
are lower than those in Figure 11 because the test area
in this experiment is much larger. The fact that our geolocalization
results are still much better than the baseline
method - SIFT matching demonstrate the ability of generalization
of our proposed approach to unseen cities.

Figure 6. Geo-localization results on Manhattan images with different
error thresholds. (Left) Results of using street view images
as query and bird's eye view images as reference. (Right) Results of
using bird's eye view images as query and street view images as
reference.
References
[1] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards
real-time object detection with region proposal networks. In
NIPS, pages 91-99, 2015.
[2] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity
metric discriminatively, with application to face verification.
In CVPR, volume 1, pages 539-546, 2005.
[3] M. Pavan and M. Pelillo. Dominant sets and pairwise clustering.
IEEE TPAMI, 29(1):167-172, 2007.
[4] A. R. Zamir and M. Shah. Image geo-localization based on
multiplenearest neighbor feature matching usinggeneralized
graphs. IEEE TPAMI, 36(8):1546-1558, 2014.
[5] T. Y. Lin, Y. Cui, S. Belongie, and J. Hays. Learning
deep representations for ground-to-aerial geolocalization. In
CVPR, pages 5007-5015, June 2015.
[6] N. N. Vo and J. Hays. Localizing and orienting street views
using overhead imagery. In ECCV, pages 494-590, 2016.
[7] S. Workman, R. Souvenir, and N. Jacobs. Wide-area image
geolocalization with aerial reference imagery. In ICCV,
pages 3961-3969, 2015.