No matter how many times you’ve flown, sitting at the window seat and watching the world shrink away from view as the plane takes off never seems to grow old. Towering trees and skyscrapers become mere pixels, roads and rivers now thin winding ribbons, and vast tracts of land appear as tiny thumbnails below.
The familiar can become unrecognizable as we’re transported from the ground up into the air. People sometimes struggle with this change in perspective, and it turns out machines do too — especially those tasked with helping to make navigation easier.
Striving to create more accurate geolocation systems, researchers have in recent years been making use of satellite imagery. The underlying idea is simple: take the image in question and compare it with those from a database of geotagged satellite images. Find a match and you’ll be able to pinpoint your location. The snag, however, is that such ground-to-aerial matching — with its potential for use in navigation, autonomous vehicles, augmented reality and other applications — is incredibly challenging.
“It’s difficult because of a drastic change in viewpoints,” says Assistant Professor Gim Hee Lee, who studies computer vision and robotic perception at the National University of Singapore’s (NUS) School of Computing. “When you compare two images from satellite and street views, they’re hardly recognisable.”
Cross-view matching, as it’s formally called, has gained increasing attention in recent years. Traditionally, geo-localisation involves comparing two images — a query one against a reference one — both taken from the ground view. This approach is relatively easy to implement but suffers from two main drawbacks. “Your reference map needs to be well-covered,” says Lee. “But it’s impossible to access every part of the world no matter how much money or manpower you have.”
Furthermore, reference images, often crowdsourced from sites such as Flickr, tend to be very biased. Images of popular places are often abundantly available while those of more isolated areas are lacking. “For example in Singapore, you see a lot of images that are focused on Gardens by the Bay, Marina Bay Sands, or the Merlion,” says Lee. “But if you want to navigate to NUS, then there will be very few images. Not to mention the heartlands like Clementi or Ang Mo Kio.”
Employing satellite images can help overcome these issues. “We can easily access them, and they have worldwide coverage,” says Lee. Which explains why ground-to-aerial matching systems have become increasingly popular for geo-localisation in recent years.
Still, one big hurdle remains: how to overcome the drastic change in viewpoint when comparing an image taken on the ground to one taken up above.
Aggregating features
Spurred on by this challenge, Lee and his PhD student Sixing Hu began working on a possible solution in early 2017. What they came up with was the Cross-View Matching Network, or CVM-Net, a machine-learning based algorithm that makes ground-to-aerial geo-localisation possible.
“We exploit the very popular deep learning approach because it can extract features from images in a very powerful way,” says Lee. Feature extraction — the identification of features in a given image — is the first step of CVM-Net.
The second stage involves aggregating these features to form a unique signature for each image. “Just like how your thumbprint is unique to you, the signature is unique to the image,” he explains. The signature generated, recorded as a string of numbers, can then be compared against pre-computed, geotagged ones in the database of satellite images to determine the location in question.
Crucially, it’s the creation of this distinctive thumbprint that has made ground-to-aerial localisation possible. “This particular step actually makes the whole process more robust and rotationally invariant,” says Lee. In other words, aggregating features within a particular image to form a unique signature can be used to pinpoint its location, regardless of the illumination or orientation of the picture.
A moonshot
After training the CVM-Net model, the researchers tested its effectiveness using two large datasets. One involved nearly 9,000 image pairs, while the other close to a million. In both instances, CVM-Net outperformed all other geo-localisation approaches in terms of accurate identification.
The researchers then proceeded to do real-world testing. Using a car fitted with 12 infrared cameras offering views in four directions, the team drove around two test sites (one urban and the other rural) in Singapore. The tests demonstrated — for the first time ever — that by simply providing images or videos of your surroundings while in a moving vehicle, CVM-Net can tell you where you are in real-time.
The impact of Lee and Hu’s work has been far and wide-reaching. “All the subsequent research has followed what we are doing,” says Lee. “We became a benchmark that everybody has to follow in order to reach this kind of performance in ground-to-aerial geo-localisation.”
Work in the field is, however, far from over. “I don’t claim that we have solved the problem,” says Lee. “There are still a lot of other problems that remain.”
One thing he and other researchers are looking into is how to do semantic labeling. “Let’s say I show you a map, can you show me where all the road networks are? Or which ones are buildings?” he says.
Generalisation is another big issue in the field. “If you train your network on dataset from one geographic location, will it also work when you bring your car to another part of the world?” says Lee.
Despite the challenges that remain, Lee is proud of how far his team has come. “When we first began, I was quite skeptical. This was like a moonshot thing because it sounded almost impossible to do in reality,” he recalls. “But then we showed a proof-of-concept and CVM-Net actually worked on a real vehicle.”