Of course the first thing we did was carry out a technology review of all the open-source geocoders we could find. Here's a list of all the ones we looked at:
- Geo-Coder-US - A Perl module developed by the ubiquitous Schuyler Erle. "For geocoding US addresses, that is, estimating the latitude and longitude of any street address or intersection in the United States, using the TIGER/Line data set". Probably no longer being developed, since it has been superseded by
- GeoCommons Geocoder::US - a rewrite of Geo-Coder-US into Ruby (and also requiring C and SQLite). "Although it is primarily intended for use with the US Census Bureau’s free TIGER/Line dataset, it uses an abstract US address data model that can be employed with other sources of US street address range data"
- JGeoCoder - A Java API loosely modelled after Geo::Coder::US. Works against a SQL database loaded with TIGER data (an H2 image is supplied). Last activity in 2008.
- Explorer GeoCoder by SRC - A C++ library for "a data and country independent geocoding engine" which can "assign latitude and longitude coordinates to any United States street address or intersection". Has an active mailing list.
- Frost Tiger Geocoder by Stephen Frost et al - a Postgres SQL library for geocoding against TIGER data
- All of the engines implement parsing and matching logic purely in code. None of them provide a declarative description language to allow easy modification of parsing, standardization, and matching rules. (To be fair, this is bit of a tall order. And it's not clear that it's even possible to provide an understandable declarative language for the fully general case. For example, the ArcMap geocoder (which appears to be the old MatchWare engine) provides a geocoding definition language (actually 5 different ones) - but the languages look scarily complex! Nonetheless, this is an important feature for easy of maintenance and customization.)
- JGeoCoder uses a large number of complex regular expressions to perform parsing. This looks like it would be difficult to customize, due to the well-known opaqueness of large REs, and perhaps also to the relative inflexibility of the RE paradigm
- The GeoCoder::US Ruby module seems to be the simplest code base. (I ended up almost understanding its parsing algorithm 8^) It uses REs, but in a saner amount. However, it's unclear how well it deals with erroneous input data, and how easy it would be to modify for a different address model.
- The Explorer geocoder uses a large amount of fairly complex C++ code. It also looked quite challenging to understand and modify.
- In all the projects the parser design appears to be fairly ad-hoc and poorly documented. This situation doesn't inspire confidence that it would be possible to modify the parser to support a different address model, or to handle particular kinds of input errors. (GeoCoder::US is a possible exception to this - it has a relatively simple parsing algorithm with at least some documentation).