Tuesday, 9 March 2010

More Open-Source Geocoders

To continue my previous post on open-source geocoders, here's a few more geocoding projects we've reviewed here at Refractions:

  • PAGC Postal Address GeoCoder ( ) is "a library and a CGI based web service written in ANSI C that uses an address-ranged street network shapefile". It uses a rule-based parser based on the Aho-Corasick string searching algorithm. The parser rules are user-configurable, which is nice (although the rule format is NON-user-friendly, consisting of opaque lists of integers!). Exact match, Soundex and Edit distance are used in the matching phase. Supported reference road networks include both the TIGER and the StatsCan networks. BerkeleyDB is used as the reference network data store.
  • The USC WebGIS Geocoder provides a free, size-limited geocoding service. It claims to be open source, however links to the source code are not obviously provided. It is documented as using a "rule-based parser", but it's not clear how a user could actually customize this and run their own instance. Matching uses attribute relaxation, substring matching, and Soundex. The reference dataset appears to be TIGER, stored in a MS SQLServer database.
  • The FEBRL Geocoder is a well-researched, well-documented system implemented in Python. It targets Australian road network data. It specifically does not attempt to work with North American data (but suggests that the address models are close enough that this would be possible.) The address parser is unique in using a trainable Hidden Markov Model, and also in being documented by a series of academic papers (e.g. [1] ) describing the approach in detail. An address cleaning module is supplied. Matching uses exact or "approximate matching".
  • The OpenGeocoder initiative appears to be a worthy attempt to create a geocoder under the auspices of OpenGeo (possibly as a port of PAGC?). However, this project has not had much recent activity, and doesn't appear to provide any actual code.
One salient aspect of these systems is that they provide address parsing algorithms which are based on well-understood parsing theory. This is of particular interest for our geocoder project - of which more later.


References

[1] A probabilistic geocoding system utilising a parcel based address file; CHRISTEN Peter, WILLMORE Alan, CHURCHES Tim; Data mining : ( theory, methodology, techniques, and applications ), 2006

4 comments:

samper.d said...

If you are looking for other standards such as OpenLS (OGC), note that PAGC has code to allow for the creation of an OpenLS service. This is not the case out of the box, but it is there in the code tree.

Also note recently there were python bindings for PAGC added through SWIG. Should be in trunk. That is if you are interested in python bindings.

Cheers

samper.d said...

Sorry, forgot the link:
http://wiki.osgeo.org/wiki/OpenGeocoder_2009_SOC_Ideas#PAGC_OpenLS_compatible_API

http://sourceforge.net/projects/pagc/files/

Regina Obe said...

Martin,

Just noticed that OpenStreetMap has one too.
http://wiki.openstreetmap.org/wiki/Nominatim

Seems to be based around osm data format. So you load the data using osm2pgsql (yes PostGIS based :) ) and run the scripts. Nice thing I guess is that it handles more than North America.

Damn its so hard to write a book when after we finish a chapter the world has changed again :).

Dr JTS said...

Good find, Regina.

Andrew, maybe that's your answer!