Tuesday, 9 March 2010

More Open-Source Geocoders

To continue my previous post on open-source geocoders, here's a few more geocoding projects we've reviewed here at Refractions:

  • PAGC Postal Address GeoCoder ( ) is "a library and a CGI based web service written in ANSI C that uses an address-ranged street network shapefile". It uses a rule-based parser based on the Aho-Corasick string searching algorithm. The parser rules are user-configurable, which is nice (although the rule format is NON-user-friendly, consisting of opaque lists of integers!). Exact match, Soundex and Edit distance are used in the matching phase. Supported reference road networks include both the TIGER and the StatsCan networks. BerkeleyDB is used as the reference network data store.
  • The USC WebGIS Geocoder provides a free, size-limited geocoding service. It claims to be open source, however links to the source code are not obviously provided. It is documented as using a "rule-based parser", but it's not clear how a user could actually customize this and run their own instance. Matching uses attribute relaxation, substring matching, and Soundex. The reference dataset appears to be TIGER, stored in a MS SQLServer database.
  • The FEBRL Geocoder is a well-researched, well-documented system implemented in Python. It targets Australian road network data. It specifically does not attempt to work with North American data (but suggests that the address models are close enough that this would be possible.) The address parser is unique in using a trainable Hidden Markov Model, and also in being documented by a series of academic papers (e.g. [1] ) describing the approach in detail. An address cleaning module is supplied. Matching uses exact or "approximate matching".
  • The OpenGeocoder initiative appears to be a worthy attempt to create a geocoder under the auspices of OpenGeo (possibly as a port of PAGC?). However, this project has not had much recent activity, and doesn't appear to provide any actual code.
One salient aspect of these systems is that they provide address parsing algorithms which are based on well-understood parsing theory. This is of particular interest for our geocoder project - of which more later.


References

[1] A probabilistic geocoding system utilising a parcel based address file; CHRISTEN Peter, WILLMORE Alan, CHURCHES Tim; Data mining : ( theory, methodology, techniques, and applications ), 2006

Saturday, 6 March 2010

Open Source Geocoders

One of the more interesting projects we have going on here at Refractions is to build a geocoder for use in a crime-mapping application we are developing for a client. We do have an existing geocoder codebase developed for another project. But we're not 100% happy with its performance and customizability, so we decided to look into developing a new library specifically for this project.

Of course the first thing we did was carry out a technology review of all the open-source geocoders we could find. Here's a list of all the ones we looked at:
  • Geo-Coder-US - A Perl module developed by the ubiquitous Schuyler Erle. "For geocoding US addresses, that is, estimating the latitude and longitude of any street address or intersection in the United States, using the TIGER/Line data set". Probably no longer being developed, since it has been superseded by
  • GeoCommons Geocoder::US - a rewrite of Geo-Coder-US into Ruby (and also requiring C and SQLite). "Although it is primarily intended for use with the US Census Bureau’s free TIGER/Line dataset, it uses an abstract US address data model that can be employed with other sources of US street address range data"
  • JGeoCoder - A Java API loosely modelled after Geo::Coder::US. Works against a SQL database loaded with TIGER data (an H2 image is supplied). Last activity in 2008.
  • Explorer GeoCoder by SRC - A C++ library for "a data and country independent geocoding engine" which can "assign latitude and longitude coordinates to any United States street address or intersection". Has an active mailing list.
  • Frost Tiger Geocoder by Stephen Frost et al - a Postgres SQL library for geocoding against TIGER data
(This is summarized in tabular form on this Tsusiat page).

Some observations:
  • All of the engines implement parsing and matching logic purely in code. None of them provide a declarative description language to allow easy modification of parsing, standardization, and matching rules. (To be fair, this is bit of a tall order. And it's not clear that it's even possible to provide an understandable declarative language for the fully general case. For example, the ArcMap geocoder (which appears to be the old MatchWare engine) provides a geocoding definition language (actually 5 different ones) - but the languages look scarily complex! Nonetheless, this is an important feature for easy of maintenance and customization.)
  • JGeoCoder uses a large number of complex regular expressions to perform parsing. This looks like it would be difficult to customize, due to the well-known opaqueness of large REs, and perhaps also to the relative inflexibility of the RE paradigm
  • The GeoCoder::US Ruby module seems to be the simplest code base. (I ended up almost understanding its parsing algorithm 8^) It uses REs, but in a saner amount. However, it's unclear how well it deals with erroneous input data, and how easy it would be to modify for a different address model.
  • The Explorer geocoder uses a large amount of fairly complex C++ code. It also looked quite challenging to understand and modify.
  • In all the projects the parser design appears to be fairly ad-hoc and poorly documented. This situation doesn't inspire confidence that it would be possible to modify the parser to support a different address model, or to handle particular kinds of input errors. (GeoCoder::US is a possible exception to this - it has a relatively simple parsing algorithm with at least some documentation).
In the end we decided not to use any of these projects. I'll talk about what we did do in another post.

Monday, 1 March 2010

JTS Version 1.11 released!

JTS Version 1.11 is now available for download from SourceForge.


The version contains numerous enhancements, including

  • Delaunay triangulation and Voronoi diagrams
  • AWT Shape reading and writing
  • Geometry similarity metrics
  • support for Geometry densification
  • Numerous improvements to the JTS TestBuilder

Full release notes:

Functionality Improvements

  • Added CoordinateArrays.isRing
  • Added CGAlgorithms.signedArea(CoordinateSequence)
  • Added CoordinateArrays.copyDeep(...) method to copy sections of arrays
  • Added CoordinateList.add(Coordinate[], boolean, int, int) method to add sections of arrays
  • Added LineSegment.toGeometry(), LineSegment.lineIntersection()()
  • Added LineSegment.hashCode()
  • Added geometric similarity classes (HausdorffSimilarityMeasure, AreaSimilarityMeasure)
  • Added MinimumDiameter.getMinimumRectangle()
  • Added MinimumBoundingCircle class
  • Added Densifier class
  • Added triangulation API, including QuadEdgeSubdivision, IncrementalDelaunayTriangulator, ConformingDelaunayTriangulator and supporting classes
  • Added VoronoiDiagramBuilder to perform Voronoi diagram generation
  • Added scaleInstance(scaleX, scaleY, x, y) to AffineTransformation
  • Added AffineTransformationFactory to allow generating transformations from various kinds of control inputs
  • Added BinTree.remove() method
  • Fixed BinTree.query() to allow null interval arguments
  • Added ShapeReader API to convert Java2D Shapes into JTS Geometry
  • Added ShapeWriter API to convert JTS geometry into Java2D Shapes
  • Added FontGlyphReader API to render Java2D text font glyphs into geometry
  • Added SdeReader to jtsio library
  • Added Debug break methods
  • Added Memory utility for reporting memory statistics
  • Added ObjectCounter utility for counting objects
  • Added createSquircle and createSuperCircle to GeometricShapeFactory

Performance Improvements

  • Improved performance of Geometry.getArea() and Geometry.getLength() when used with custom CoordinateSequences

API Changes

  • Deprecated WKBWriter.bytesToHex in favour of WKBWriter.toHexto regularize and simplify method name

Bug Fixes

  • Fixed Point.isValid() to check for invalid coordinates (ie with Nan ordinates)
  • Fixed Geometry.distance() and DistanceOp to return 0.0 for empty inputs
  • Fixed Buffer to handle degenerate polygons with too few distinct points correctly
  • Added illegal state check in LineSegment.pointAlongOffset()
  • Fixed exception strategy in BufferSubgraph to handle certain robustness failures correctly
  • Fixed robustness problem in OffsetCurveBuilder in computing mitred joins for nearly parallel segments
  • Fixed minor bug in BufferInputLineSimplifier which prevented simplification of some situations
  • Fixed bug in BufferInputLineSimplifier which caused over-simplification for large tolerances
  • Fixed bug in Angle.normalizePositive to handle values > 2PI correctly
  • Fixed WKTWriter to emit correct syntax for MULTIPOINTs
  • Fixed WKTReader to accept correct syntax for MULTIPOINTs
  • CGAlgorithms.isCCW now checks for too few points in the ring and throws an IllegalArgumentException
  • Fixed bug in AffineTransformation#eqals (logic bug)
  • Fixed bug in CoordinateList#closeRing (cloning closing Coordinate)

JTS TestBuilder

Functionality Improvements

  • WKT input is cleaned automatically when loaded (illegal chars are removed)
  • Added WKT-Formatted option to Test Case View dialog
  • Many new geometry functions added
  • Geometry functions are displayed in tree
  • Geometry functions can be implemented as Java static class methods.
  • Geometry function classes can be loaded dynamically from command-line
  • Improved handling of very large geometry inputs and results
  • Threaded rendering allows display of very large geometries without limiting usability
  • Added Draw Rectangle tool
  • Added Drag-n-drop loading of .SHP files
  • Added Info tool to provide persistent display of geometry point/segment information
  • Added display of memory usage