Summary
A fast, offline reverse geocoder in Python. This implementation uses a parallelised K-D tree and the details of this implementation will be presented. The key feature is speed; 10 million coordinates can be geocoded in less than 30 seconds. The library is released under the LGPL license and is available at https://github.com/thampiman/reverse-geocoder.
Description
Introduction
Reverse geocoding using online web services such as Google Maps is incredibly slow and is also restrictive in terms of the number of requests that can be made per day. Offline reverse geocoders have been built for PostGIS databases and also Python but are either complicated or slow. In this talk, I will be presenting a fast, offline reverse geocoder in Python. The basic outline of the talk is presented below.
The Library
The library improves on an existing one built by Richard Penman in the following ways:
- It supports Python 2 and 3.
- It geocodes a lot more location information. Besides the place name, city and country, the library returns the administrative regions (1 & 2) and the nearest latitude and longitude.
- But the key enhancement is performance. The library extends the K-D tree class in the scipy package and implements a parallelised version of it.
This reverse geocoder is released under the LGPL license and is available here.
Implementation
The first time the library is called, information on places with a population greater than 1000 is downloaded from the Geonames database, and it is stored locally. The GPS coordinates of these places are populated in a K-D tree and the nearest neighbour (NN) algorithm is then used to find the place closest to the input GPS coordinate. The scipy package provides a K-D tree class and this is extended to implement a multi-process version. In this talk, I will be presenting details of this implementation. A basic background in Python, numpy, multi-processing and shared memory is assumed. The K-D tree class in the scipy package supports only the Minkowski p-norm distance for the NN algorithm. Although this has not been released publicly, I will also be presenting a version of the library using the haversine formula for much more accurate geocoding.
Performance Study
The library supports two modes:
- Single-process mode (Mode 1)
- Multi-process mode (Mode 2): The default mode
A performance comparison of the two modes on a quad-core Macbook Pro is shown below.
Mode 2 runs 2x faster especially for large inputs, i.e. 10M coordinates.
Applications
In this part of the talk, I will discuss how the library is being used at OpenSignal, where I work as a data scientist. The main purpose for building the library was to be able to geocode terabytes of data (approx. 500M coordinates). Speed was therefore crucial. I will discuss methods on geocoding at this scale in real-time and also offline. I will also talk about how this open-source library is being used by the community.
Contributions by the Community
Since its release on Github on 27-Mar-2015, the open-source community has also been instrumental in testing, fixing bugs and implementing additional features. In this part of the talk, I will given an overview of the following two major changes made by other developers:
- Python 3 support, and
- C++ wrapper for the Python library.