Saturday 6 March 2010

Open Source Geocoders

One of the more interesting projects we have going on here at Refractions is to build a geocoder for use in a crime-mapping application we are developing for a client. We do have an existing geocoder codebase developed for another project. But we're not 100% happy with its performance and customizability, so we decided to look into developing a new library specifically for this project.

Of course the first thing we did was carry out a technology review of all the open-source geocoders we could find. Here's a list of all the ones we looked at:
  • Geo-Coder-US - A Perl module developed by the ubiquitous Schuyler Erle. "For geocoding US addresses, that is, estimating the latitude and longitude of any street address or intersection in the United States, using the TIGER/Line data set". Probably no longer being developed, since it has been superseded by
  • GeoCommons Geocoder::US - a rewrite of Geo-Coder-US into Ruby (and also requiring C and SQLite). "Although it is primarily intended for use with the US Census Bureau’s free TIGER/Line dataset, it uses an abstract US address data model that can be employed with other sources of US street address range data"
  • JGeoCoder - A Java API loosely modelled after Geo::Coder::US. Works against a SQL database loaded with TIGER data (an H2 image is supplied). Last activity in 2008.
  • Explorer GeoCoder by SRC - A C++ library for "a data and country independent geocoding engine" which can "assign latitude and longitude coordinates to any United States street address or intersection". Has an active mailing list.
  • Frost Tiger Geocoder by Stephen Frost et al - a Postgres SQL library for geocoding against TIGER data
(This is summarized in tabular form on this Tsusiat page).

Some observations:
  • All of the engines implement parsing and matching logic purely in code. None of them provide a declarative description language to allow easy modification of parsing, standardization, and matching rules. (To be fair, this is bit of a tall order. And it's not clear that it's even possible to provide an understandable declarative language for the fully general case. For example, the ArcMap geocoder (which appears to be the old MatchWare engine) provides a geocoding definition language (actually 5 different ones) - but the languages look scarily complex! Nonetheless, this is an important feature for easy of maintenance and customization.)
  • JGeoCoder uses a large number of complex regular expressions to perform parsing. This looks like it would be difficult to customize, due to the well-known opaqueness of large REs, and perhaps also to the relative inflexibility of the RE paradigm
  • The GeoCoder::US Ruby module seems to be the simplest code base. (I ended up almost understanding its parsing algorithm 8^) It uses REs, but in a saner amount. However, it's unclear how well it deals with erroneous input data, and how easy it would be to modify for a different address model.
  • The Explorer geocoder uses a large amount of fairly complex C++ code. It also looked quite challenging to understand and modify.
  • In all the projects the parser design appears to be fairly ad-hoc and poorly documented. This situation doesn't inspire confidence that it would be possible to modify the parser to support a different address model, or to handle particular kinds of input errors. (GeoCoder::US is a possible exception to this - it has a relatively simple parsing algorithm with at least some documentation).
In the end we decided not to use any of these projects. I'll talk about what we did do in another post.

11 comments:

v said...

Hi,
Did you have a look too at PAGC ( http://www.pagcgeo.org/ ) ?

Is your work related to the OpenGeocoder initiative ( http://wiki.osgeo.org/wiki/OpenGeocoder ) ?

Dr JTS said...

I saw the PACG project just as I was posting this. I'll do another post mentioning it and another project as well.

The OpenGeoCoder project sounds great, but appears to be somewhat dead in the water. Any idea if there is any code attached to it, or if it is still active?

v said...

Apparently OpenGeocoder aims to refactor pagc geocoder. There has been an intention to have some GSOC projects last year, but it doesn't seem that was done at last.
PGGeocoder's dev are also involved in opengeocoder, or at least interested :
http://pggeocoder.postlbs.org/

I'd be glad to hear some more info on opensource geocoding in general, and your project in particular :)

vincent

Michael said...

We've faced similar problems getting the best from multiple geocoders, so we put together a C# library that allows our projects to use a combination of publicly available and custom geocoders. It allows geocode requests to be pre- and post-processed to allow for project-specific requirements. It also allows us to to switch to a new geocoder by adding a dll and changing config files, rather than requiring rebuilding the project.
http://sourceforge.net/projects/omgeo/

Andrew Harvey said...

Know any opensource geoencoders that work with OpenStreetMap data?

Firefishy said...

@Andrew Nominatim is currently the preferred OpenStreetMap Geocoder/reverse-geocoder.

Site:
http://nominatim.openstreetmap.org/

Details:
http://wiki.openstreetmap.org/wiki/Nominatim

Andrew Harvey said...

@Firefishy

great! just what I was looking for.

George Silva said...

This is very interesting and I did a thesis adapting an algorithm created by David Bitner to process brazilian addresses.

I wonder if people imagine that there are tons and tons of users around the globe that _do not_ use TIGER lines.

I more generic algorithm is much more interesting. If there any projects around that I can contribute, please, let me know :D

Stephen Woodbridge said...

The OpenGeocoder project will probably get done in the context of the PAGC project. We have already changed the the licensing of PAGC to be MIT-X like and have started efforts to get the code refactored with support for SQLite and Postgresql. It is a slow process as we currently only have one part-time developer that needs to be funded for any major changes. I just added support for a single line address parser. So things are moving along even if somewhat slowly.

MetalMASK said...

My company faces very similar situation to yours, and I've been using CASS stage 1 test set (what USPS use to test address normalization software) to test them. JGeocoder actually performs OK, despite the data it comes with only have one state on the street level (PA). GeoCommons seems to be still in development as there are quite a few crucial features (such as accuracy) not working yet.


Athough I agree most of your observation, I still have to say that RegExs are the best solution so far (what alternative do we have?) for dealing with messy input. The hard limitation is that it is not very easy to be extended.

I am curious what your company did as a solution to the geocoding challenge.

Dr JTS said...

@Alex:

Thanks for sharing your practical experience.

Qualified agreement on the utility of RegExes. Our solution to address parsing used RegExes to identify and clean input errors, and then a regular grammar to parse the addresses. This worked well, and was actually quite extensible/easy to modify. The key concept was to create a parser generator for address "languages". Unfortunately I didn't get this quite to the point of releasable code, but the approach proved out in practical use.