Skip to main content

GAMa: Cross-view Video Geo-localization



Shruti Vyas, Chen Chen, Mubarak Shah, GAMa: Cross-view Video Geo-localization, European Conference on Computer Vision, 2022.


The existing works in cross-view geo-localization follow an image-based approach where a ground image is matched with an aerial image. In such an approach, the contextual information available with the video is lost. In this paper, we focus on the geo-localization of ground videos instead of images to utilize the context, i.e. how the view in one frame is related/located w.r.t. another frame. To the best of our knowledge there are no existing datasets which are publicly available and can be used for this problem. Thus, we have collected a new dataset, named as GAMa (Ground-video to Aerial-image Matching), which contains ground videos with GPS labels and corresponding aerial images.

We propose GAMa-Net as a benchmark method to solve this problem at clip level where we match every 0.5 second (short clip) from a long video with the corresponding aerial view. Next, we propose a hierarchical approach which helps in improving the clip-level geo-localization performance while providing a video-level geo-localization with the help of clip-level predictions. It takes the set of aerial images corresponding to the clips of a long video, and matches them against a larger geographical area. Therefore, it makes use of the contextual information available with the sequence of clips corresponding to a longer video.


  • A novel problem formulation i.e. cross-view video geo-localization and a large-scale dataset, GAMa, with ground videos and corresponding aerial images. This is the first video dataset for this problem to the best of our knowledge.
  • We propose GAMa-Net, which performs cross-view video geo-localization at clip-level by matching a ground video with aerial images using an image-video contrastive loss.
  • We also propose a novel hierarchical approach which provides video-level geo-localization and utilizes aerial images at different scales to improve the clip-level geo-localization.


The proposed GAMa (Ground-video to Aerial-image Matching) dataset comprises of select videos from BDD100k and aerial images from apple maps. The dataset comprises of one large aerial image (1792×1792) corresponding to each video of around 40 sec. and 49 uncentered small aerial images (256×256) for these large aerial regions.

Table 1 summarizes the dataset statistics. Since most of the videos have a GPS label every second, we divide the videos into smaller clips of 1 sec. each and for each clip we have a uncentered small aerial image. We also have a centered set, 1.68M small aerial image, where the image is centered around the GPS label.


Figure 1: An outline of the proposed approach. From a given ground video, clips of 0.5 sec are input to GAMa-Net, one clip at a time is matched to an aerial image. The sequence of aerial images thus obtained for a video is input to the Screening network, to retrieve the large aerial region for video-level geo-localization. Top predictions of larger aerial regions provide the updated gallery for GAMa-Net.

Figure 1 explains the proposed approach. We have four steps in this approach. In Step-1, we use GAMa-Net (Figure 2) which takes one clip (0.5 sec) at a time and matches with a small aerial image. Using multiple clips of a video, we get a sequence of aerial images for the whole video, i.e. around 40 images. In Step-2, we use predictions of aerial images and match them to the corresponding larger aerial region. We use a screening network to match the features in the same view i.e aerial view. In Step-3, we use screening network predictions to reduce the gallery size (i.e. search space) by selecting top ranked large aerial regions corresponding to a video. These large aerial regions define our new search area for a given video. In Step-4, we use GAMa-Net i.e. the same network as in Step-1, however localize using the updated gallery.

Figure 2: Network diagram of GAMa-Net proposed for clip-level geo-localization. We use 3D-CNN as our base network for learning features from a ground clip (around 0.5sec). Similarly, for aerial image features, we use a 2D CNN backbone. Since only some parts of aerial images are covered by the video feed, using a transformer encoder improves the learning. Number of frames, k=8 with skip rate=1

An overview of the proposed GAMa-Net is shown in Figure 2. used for clip-level geolocalization. In GAMa-Net, we have a video encoder i.e. Ground Video Encoder (GVE) to get features from ground video frames and an image encoder for aerial image features i.e. Aerial Image Encoder (AIE). In the ground video all the visual features are not of equal importance for matching with the aerial view. We utilize contrastive loss formulation, base on NT-Xent, to train our network. This is a image-video contrastive loss applied on features from two different visual modalities i.e. ground videos and aerial images. In hierarchical approach, we introduce video-level geo-localization in contrast to clip level geo-localization which also helps in reducing the search space for GAMa-Net.


Figure 3: Geo-localization results for two query clips, using different models. Top-row shows frames of the query clips. Second row is for combined model (Top-5 predictions), and third row is GAMa-Net. Bottom row shows predictions by GAMa-Net with Hierarchical approach ( gallery reduced to 1% of larger aerial regions). Correct predictions have a green outline. Owing to close GPS labels there are multiple correct aerial images.

Figure 3 show sample Top-5 predictions with different models where the leftmost is Top-1 and the rightmost is 5th. The combined model, i.e. GAMa-Net without transformer encoder, makes visually meaningful predictions. The left example is of a road without any crossings or red lights in sight, and right-most example is of a city street with crossing markings on road. The predictions by combined model match these specifications. However, in these samples, the ground truth is in top-1% but not in top-5 images. The predictions by GAMa-Net, with multi-headed self-attention improves the network performance and correct prediction moves up in the top-5. From the last row it is evident that results by GAMa-Net are improved with hierarchical approach.