Friday, August 21, 2015

Variable-Width Buffers in JTS

Inspiration:  this post on the JSTS group (with an image - good job on requirements!)

JTS implementation: in the lab

Code:   Geometry geom = VariableWidthBuffer.buffer( line, 10, 80 );


SliceGraphs in JEQL

In the interests of increasing blog output, I'm going to experiment with using a terser style in posts where the content is mostly self-explanatory.  Sort of an embedded microblog...

Inspiration: original Population Lines print , this plot from the Line Graphs in R blog post:

Data: SEDAC World Population grid, subsampled

JEQL script


Wednesday, July 15, 2015

Even-distribution Random Point and Polygons in JTS

Recently I fixed the JTS KD-Tree implementation so that it works as advertised with a distance tolerance to provide point snapping.  This gives a fast way to produce random point fields with even distribution (i.e. no points too close together).

First, generate a batch of random points using RandomPointsBuilder.  As is well known, this produces a very "lumpy" distribution of points:

Then, put them in a KD-Tree using a snapping distance tolerance.  Querying all points in the final tree produces a nice even distribution of points:

Using the Concave Hull algorithm available here with the same distance tolerance produces a random polygon with a very pleasing appearance:
I suspect that these kinds of polygons might be useful for generating stress tests for geometric algorithms.

UPDATE: Adding a bit of Bezier Smoothing produces an even cooler-looking polygon:

Friday, November 15, 2013

Maslow's Hierarchy in the 21st Century

Been a while since I posted, so posting some humour seems like a good start to getting back on track...

Wednesday, June 12, 2013

How to get OpenLayers WMSGetFeatureInfo to emit GeoServer CQL Filters for multiple layers

OpenLayers provides the useful WMSGetFeatureInfo control.  It's designed to work with the standard WMS GetFeatureInfo request.  As per the standard, the control supports querying multiple layers via setting the layers property.

It's often necessary to define client-side filters for WMS layers, to display only a subset of the layer data in the backing feature type. Usually the filters need to be defined dynamically, based on the application context.   When using GeoServer as the web mapping engine a convenient (but non-standard) way of doing this is to use the CQL_FILTER WMS parameter. (One might reasonably ask why there isn't an equally simple way to do this in the WMS standard itself, but that's another story).  In OpenLayers this parameter can be added dynamically to a layer via the mergeNewParams method:

lyr.mergeNewParams({'CQL_FILTER': "filter expression" });

Naturally it is necessary to have the GetFeatureInfo control respect the layer filters as well.  This is straightforward in the case of a single layer.  The GeoServer CQL_FILTER parameter can be supplied using the  vendorParams property on the WMSGetFeatureInfo control:

infoControl.vendorParams = { 'CQL_FILTER': 'filter expression'};

Since the CQL_FILTER parameter supports a list of filters, it's also straightforward to filter multiple layers as long as the list of layers queried is static:

infoControl.vendorParams = { 'CQL_FILTER': 'filt-1; filt-2; filt-3'};

But WMSGetFeatureInfo also provides the useful ability to query only visible layers (via the queryVisible property).  This makes things much trickier, since the list of filter expressions must match the list of layers provided in the QUERY_LAYERS parameter.  There's no built-in way of doing this in OpenLayers itself (not surprisingly, since the CQL_FILTER parameter syntax is specific to GeoServer only).

One way to do this is to build the CQL_FILTER parameter value dynamically uisng the CQL_FILTERs defined for the visible layers.  This can be done when the control is invoked, via hooking the beforegetfeatureinfo event.

Here's a code snippet to do this:

var infoControl;

function initInfoControl()
  infoControl = new OpenLayers.Control.WMSGetFeatureInfo({
		url: wms_url,
		title: 'Identify features by clicking',
		layers: [
		queryVisible: true,
		maxFeatures: 3,
		infoFormat: 'application/vnd.ogc.gml'
    "beforegetfeatureinfo", null, onBeforeGetFeatureInfo);
    ("getfeatureinfo", null, onGetFeatureInfo); 
function onBeforeGetFeatureInfo(event)
  // build CQL_FILTER param list from active info layer CQL_FILTER params
  var layers = infoControl.findLayers();
  var filter = "";
  for (var i = 0, len = layers.length; i < len; i++) {
    if (i > 0) 	filter += ";";
    var lyrCQL = layers[i].params.CQL_FILTER
    if (lyrCQL != null) {
      filter += lyrCQL;
  infoControl.vendorParams = { 'CQL_FILTER': filter	};

Although climbing up the OpenLayers learning curve often feels like a big struggle, it's important to recognize the very wide set of requirements that the library is trying to address.  Due to the nature of spatial data, user interfaces and protocols dealing with it are inherently complex.   The more I work with OpenLayers, the more appreciation I have for the fine balance between simplicity and flexibility the designers have achieved. (And if that sounds like I do not subscribe to the "Spatial is not special" canard, you're hearing me right!).

Wednesday, May 29, 2013

Flight Paths in JEQL Redux

The intertubes are buzzing about a flight path visualization done by Michael Markieta.  This is based on the same OpenFlights dataset that I used a couple of years ago as a demonstration of JEQL processing and visualization capabilities.

Markieta's blog post outlines his workflow using ArcGIS.  It's a bit cumbersome - apart from having to jump through hoops to read the data from the original DAT files, apparently the dataset has to be split into six parts to be able to process it.  (For a measly 58K rows?!)

No details are provided about styling, which is the key part of the exercise.   The images apparently use alpha blending to show flight density.  Also, the coordinate system seems to be more curvaceous than the squaresville Plate Carree I used (so much more haute couture than saying Lat/Long). Both of these are easy to do in JEQL.  Here's some samples of the improved output, using the alpha channel and a Mollwiede projection.

North America

And here's the entire image, in glorious hi-res suitable for framing:

Monday, May 13, 2013

Beautiful cartography using OpenJUMP

An OpenJUMP user just posted some really nice cartographic maps made using a combination of OpenJUMP, Inkscape, GRASS, and GIMP.

He gives OJ the following glowing endorsement:
I find Open JUMP to be the most vector-friendly open source GIS software. The preparation of the datasets (rivers, lakes, sea, roads, borders) was really [a] piece of cake...
It's great to see the small but dedicated OpenJUMP community steadily adding new features and improving the software quality.  10 years after it was launched, OpenJUMP continues to be the "Little Open-Source GIS that Can".

Thursday, March 14, 2013

10 Step Program for Developers

Andrew Oliver lays out the 10 Step Program for developers.  Here's his points:
  1. Blog
  2. Go open source
  3. Not six months, not 10 years
  4. Eye on the new stuff, hands on the practical
  5. Write your own documentation
  6. Brevity is the soul
  7. Wow the crowd
  8. Be realistic
  9. Solve the hard stuff, know the tools  (hmm.. isn't that two points?)
  10. Practice humility
This blog post is my practice of points 1 and 6.  And also one of my own:

11. Copy the work of other smart people

Tuesday, February 12, 2013

The subversiveness of Open Source

It's no longer novel to observe that Open Source is, if not the dominant software paradigm of the era, at least one of the most significant innovations in the history of software practice.  Recently it struck me how downright bizarre the Open Source paradigm really is.  I can't think of another field of human endeavour where the fundamental paradigm mandates giving away the product of one's labour.  Consider a few sweepingly-generalized examples:
  • Business - Fugedaboudit!  It's all about the money.  Apart from the Diggers of 60's Haight-Ashbury notoriety there aren't too many examples of businesses whose model consists of giving away their stock.
  • Arts - Hah!  Obviously the big media companies are doing everything they can to squeeze money out of artistic endeavour.  But even among the less mercantile stakeholders the main discussion is about how artists can be compensated for their creations.  No-one seriously advocates that artists give away all their work for free. 
  • Sport -  Don't get me started on the gross discrepancy between compensation and value in professional sport.  And at the amateur level, sponsorship and funding organizations are recognized to be essential to promoting the continued generation of sporting "product".  (Wouldn't it be great if there was a similar system of sponsorship for software developers?)
  • Science - You might think this would be the exception that proves the rule.  After all, sharing research results is a revered principle of scientific progress.  The domain relies on publishing information openly to an even greater extent than in software development.  But in my (admittedly limited) experience many scientists are actually quite protective of their intellectual property, since their livelihood depends in a direct way on amassing it and monetizing how it is dispensed.  And it's well known that academic institutions pay very close attention to licensing the IP generated by them (or their employees).
Just to be clear, I am not suggesting that the open source paradigm is flawed or wrong.  In fact, I spend the major part of my professional life living and breathing Open Source geospatial software (JTS/GEOS, JEQL, Proj4J, GeoServer, PostGIS, etc). As a means of increasing the velocity and quality of software development it's by far the best model. And it's much more democratic and self-actualizing than the semi-feudal alternatives.  But it really is a subversive concept.  Marxist, even.  It's no wonder that it's taking so long for the suits to wrap their heads around how to deal with it.

Long live the anarcho-syndicalist commune of Open Source Software craftsmen!

Wednesday, February 6, 2013

JTS Union VS ArcGIS Dissolve

Ragnvald Larsen has an interesting post on ways to mitigate the poor performance and stabilty of Dissolve computations in ArcGIS.  Dissolve is the Arc term for the geometric union of a collection of polygons (possibly grouped by attribute, although that capability was not used in this case).

Ragnvald's dataset consisted of a 15 MB shapefile containing about 7000 overlapping polygons.  Here's what the data looks like:

He found that using the ArcGIS Dissolve method took about 150 sec to process the dataset.  In an effort to reduce this time, he experimented with partitioning the dataset and doing the union in batches.  After a (presumably lengthy) series of experiments to find the optimal batch size, he was able to get the time down to 25 sec using a batch size of 110 features.

Improving union performance by partioning the input is the basic idea behind the Cascaded Union function in JTS (which I blogged about back in 2007).  Cascaded Union uses a spatial index to automatically optimize the partitioning.  Ragnvald doesn't mention whether he used a spatial index, but I suspect this might be quite time-consuming to code in ArcPy.

I thought it would be interesting to compare the performance of the JTS algorithm to the ArcGIS one.  To do this I used JEQL, which provides an easy high-level way to read the data and invoke the JTS Cascaded Union.  The entire process can be expressed as a very simple JEQL script:

ShapefileReader t file: "agder/agder_buffer.shp";
t = select geomUnionMem(GEOMETRY) g from t;
ShapefileWriter t file: "result.shp";

geomUnionMem is a JEQL spatial aggregate function which is implemented using the JTS Cascaded Union algorithm.  (Although not needed in this case, note that the more general Dissolve use case of unioning groups of features by their attributes can easily be achieved by using the standard SQL GROUP BY clause.)

Running this on a (late-model) PC workstation produced a timing of about 1.5 sec!

Here's the output union:

Thursday, January 3, 2013

Functional Programming Whinging

Tim Bray thinks Uncle Bob Martin's post on Functional Programming Basics is the cat's pyjamas.

Meh. "Basics" is the key word in that title - the article is pretty light and fluffy.  Fine if you don't know squat about FP, but it's also accompanied by a whole lot of starry-eyed razzle-dazzle which isn't really justified by the content (and note that I'm not saying it's wrong, just not substantiated).

To be fair, TB does have a few gripes.  Here's a few more:

  • The example used to show how FP wonderfully avoids variables and side-effects is that hoary old one of computing squares of integers.  (I mean really hoary - this was the first program I ever wrote, in WATFIV.  And I at least had cool line printer output!)  How about using something that's a bit more representative of an actual computational problem?  Like say, red-black trees - with deletion!  
  • As TB points out, the people who really need to make algorithms run fast across 64 cores are a small percentage of current coders.  For everyone else, scale-out is a more mundane but pressing problem.  And it's not clear to me whether FP will make that easier.
  • As someone who spends his leisure hours trying to make spatial algorithms more performant, I'm suspicious of anything that promises to automagically make code go faster across multiple cores.  In spatial most interesting problems are not "pleasantly parallel", and many of them are memory-bound as well as being compute-bound.  So advances in performance would seem depend on better algorithms, not a different choice of language.
Back in the day I was pretty keen on FP languages - but I realized after being exposed to Smalltalk and later Java, a lot of their appeal was due to their (necessary) provision of automatic memory management (which was painfully lacking in the "mainstream" languages such as FORTRAN, Pascal, C - oh, and even C++).

But I'm not trying to prove a negative here.  Certainly the FP features of no side-effects and lazy evaluation would seem to offer a lot of benefit for the right class of problems.  And FP or FP-ish languages are more mainstream than ever before.  So perhaps they really will become the mainstream language paradigm.  I just hope I don't have to be coding using layers of inconveniently situated parentheses.

Lead, the criminal element

I've heard before about the postulated link between atmospheric lead levels (courtesy of the leaded gasoline used through the middle decades of the 20th century) and crime levels.  This Mother Jones article America's Real Criminal Element: Lead is the best explanation I've seen so far (and has links to the original papers).  It really sounds like this hypothesis is fully confirmed - and the best thing about this story is that it has a happy ending.  (Unless you're trying to get elected as mayor - or Prime Minister - on a tough-on-crime platform).

There is a nice geospatial connection here.  As with many epidemiological issues, spatial locality is an important aspect of the analyses that lead (ahem) to the conclusion.  The article is chock-full of references to the spatial nature of the problem, such as:
We now have studies at the international level, the national level, the state level, the city level, and even the individual level
and my favourite:
a good rule of thumb for categorizing epidemics: If it spreads along lines of communication, he says, the cause is information. Think Bieber Fever. If it travels along major transportation routes, the cause is microbial. Think influenza. If it spreads out like a fan, the cause is an insect. Think malaria. But if it's everywhere, all at once—as both the rise of crime in the '60s and '70s and the fall of crime in the '90s seemed to be—the cause is a molecule.

Wednesday, January 2, 2013

2012 Year in Review - Blog Roundup

A look back at 2012 from a software technology perspective by some of my favourite blogs:
  • Inspired By Actual Events - a wide-reaching roundup. I found the Java and friends links especially interesting, since the Java/JVM world is so big now it's hard to keep up with and distill the really significant events.
  • Interoperability Happens (Ted Nedward) - As usual, opinionated and insightful commentary on enterprise software technology from a hard-core developer perspective.
  • Tim Anderson - A strong focus on Microsoft, but also fairly even-handed assessment of the rest of the "A"-team (Apple, Android/Google, and Amazon).  (And a not-very-optimistic mention of the "B" team - BB/RIM). I always appreciate Tim Anderson's reading of the internal and external tea-leaves of MS technology.  It's always fascinating to see the elephant trying to jump, in a schadenfreudal sort of way.
  • Tim Bray - Not really a roundup, and not all that tech-focussed, but always a good read.

Tuesday, December 25, 2012

JTS 1.13 Released

I'm to announce that JTS 1.13 has been released.  It's available for download from SourceForge.

There is a long list of new features, enhancements and bug fixes in this release:

Functionality Improvements

  • Changed GeometryFactory.createGeometry() to make a deep copy of the argument Geometry, using the CoordinateSequenceFactory of the factory
  • Added ability to specify a dimension in CoordinateArraySequence
  • Changed Geometry.getEnvelopeInternal() to return a copy of the cached envelope, to prevent modification
  • Added GeometryEditor.CoordinateSequenceOperation to allow easy editing of constituent CoordinateSequences
  • Added GeometryFactory.createPolygon convenience methods which do not require holes to be specified
  • Geometry overlay methods now return empty results as atomic types of appropriate dimension
  • Added RectangleLineIntersector to provide efficient rectangle-line intersection testing
  • Added getOrdinate and setOrdinate to Coordinate
  • Quadtree is Serializable
  • STRtree is Serializable
  • Added max, average and wrap functions to MathUtil
  • Improved WKTReader parse error reporting to report input line of error
  • Improved WKBReader to repair structurally-invalid input
  • Made TopologyPreservingSimplifier thread-safe
  • Added AbstractSTRtree.isEmpty() method
  • Added QuadTree.isEmpty() method
  • Added KdTree.isEmpty() method
  • Added decimation and duplicate point removal to ShapeWriter.
  • ScaledNoder now preserves Z values of input
  • Added instance methods for all Triangle static methods
  • Added CGAlgorithmsDD containing high-precision versions of some basic CG algorithms
  • Added IntersectionMatrix.isTrue() method for testing IM pattern matches
  • Added getRawCoordinates methods to PackedCoordinateSequence concrete classes
  • Modified Geometry.isSimple() to explicity check for simplicity for all types, and support GeometryCollections
  • Improved MCIndexSnapRounder to add nodes only where they are necessary
  • Added CoordinateArrays.removeNull() method
  • Enhanced GeometryEditor to handle null geometries returned from operation
  • Added WKBHExFileReader
  • Added Distance3D operation

Performance Improvements

  • Simplified & improved performance of RectangleIntersects by using new RectangleLineIntersector
  • In RandomPointsInGridBuilder eliminated redundant ArrayList usage
  • In PreparedPolygonIntersects and PreparedLineStringIntersects added check to avoid creating segment index if all test inputs are points
  • In AbstractSTRtree switched to using indexed list access for better performance than using iterators
  • In AbstractSTRtree freed inserted item array after index is built
  • Improved performance of Polygonizer for cases with many potential holes
  • Improved performance for some DD methods by making them final
  • Added fast filter for CGAlgorithmsDD.orientationIndex, and switched to self-operations for DD determinant
  • Changed STRtree.createNode() to use a static class for nodes
  • Changed QuadTree Node to use scalar x and y variables rather than a Coordinate to reduce memory allocation
  • Fixed PreparedGeometry concrete classes to be thread-safe.
  • Fixed SortedPackedIntervalRTree so that it is thread-safe.

Robustness Improvements

  • Switched to using DD extended-precision arithmetic to compute orientation predicate
  • CGAlgorithms.distanceLineLine() improved to be more robust and performant
  • Fixed robustness issue causing Empty Stack failure in ConvexHull for some nearly collinear inputs
  • CGAlgorithms.signedArea() uses a more accurate algorithm

Bug Fixes

  • Fixed Geometry.equalsExact() to avoid NPE when comparing empty and non-empty Points
  • Fixed CascadedPolygonUnion to discard non-polygonal components created during unioning, to avoid failures and provide more desirable behaviour
  • Fixed CentralEndpointIntersector to initialize result correctly
  • Fixed DelaunayTriangulationBuilder.extractUniqueCoordinates(Geometry) to avoid mutating the vertex order of the input Geometry
  • Fixed ConformingDelaunayTriangulationBuilder to allow non-disjoint site and constraint vertex sets
  • Fixed RandomPointsInGridBuilder point generation to use circle constraint correctly
  • Fixed Linear Referencing API to handle MultiLineStrings consistently, by always using the lowest possible index value, and by trimming zero-length components from results
  • Fixed bug in LocationIndexedLine and LengthIndexLine which was causing an assertion failure when the indexOfAfter() method was called with a constraint location which is at the end of the line
  • Fixed bug in STRtree.query(Envelope, ItemVisitor) causing an NPE when tree is empty
  • Fixed issue with creating zero-length edges during buffer topology building under fixed precision, by: adding filter to remove zero-length edges; using a better estimate of scale factor for reducing to fixed precision after initial failure.
  • Fixed TopologyPreservingSimplifier to return a valid result for closed LineStrings with large distance tolerances
  • Fixed TopologyPreservingSimplifier to return an empty result for an empty input
  • Fixed DouglasPeuckerSimplifier to return an empty result for an empty input
  • Fixed MinimumBoundingCircle to correctly compute circle for obtuse triangles.
  • Fixd GeometryPrecisionReducer to use input GeometryFactory when polygon topology is fixed
  • Fixed GeometryNoder bug that was failing to snap to end vertices of lines
  • Fixed Geometry.getCentroid() and Geometry.getInteriorPoint() to return POINT EMPTY for empty inputs
  • Fixed DelaunayTriangulationBuilder to correctly extract unique points
  • Fixed KdTree to correctly handle inserting duplicate points into an empty tree
  • Fixed LineSegment.projectionFactor() to handle zero-length lines (by returning Double.POSITIVE_INFINITY)
  • Fixed LocationIndexedLine to handle locations on zero-length lines
  • Fixed LengthIndexedLine and LocationIndexedLine to handle indexOfAfter() correctly
  • Fixed WKBReader to handle successive geometrys with different endianness
  • Fixed GeometricShapeFactory to correctly handle setting the centre point
  • Fixed GeometryFactory.createMultiPoint(CoordinateSequence) to handle sequences of dimension > 3

API Changes

  • Changed visibility of TaggedLineStringSimplifier back to public due to user demand


  • Added Performance Testing framework (PerformanceTestRunner and PerformanceTestCase)
  • Added named predicate tests to all Relate test cases

JTS TestBuilder

Functionality Improvements

  • Added segment index visualization styling
  • Improved Geometry Inspector
  • Added stream digitizing for Polygon and LineString tools
  • Added output of Test Case XML with WKB
  • Added Extract Component tool
  • Added Delete Vertices Or Components tool
  • Added Geometry Edit Panel pop-up menu, with operations
  • Added Halton sequence functions
  • Added sorting functions
  • Added function for selection of first N components
  • Added CGAlgorithms functions
  • Added ability to paste and load multiple WKBHex geometries

Performance Improvements

  • Using decimation substantially improves rendering time for large geometries.

Bug Fixes

  • Fixed bug in saving XML test files

Thursday, December 20, 2012

Convenience trumps all

The always-readable Stephen O'Grady has an insightful post titled Do Not Underestimate the Power of Convenience.  He proposes that an increasingly important factor driving the uptake of software is the developer's drive for convenience.  "Convenience" is an, er, convenient term for things like ease-of-use, power, portability and low barrier to entry.  In the case of software the lowest possible barrier is provided by free open source software, but the principle applies to infrastructure as well (reflected by the rapid uptake of things like cloud computing and BYOD).  As evidence he lists a set of technologies whose prevalence has been driven bottom-up by developers, rather than top-down by corporate fiat.  It reads like the bill-of-materials for IT infrastructure in a start-up: AWS, Linux, dynamic languages, Git, Eclipse, etc.

One reason he gives for this situation is the pleasing (to my ears) observation that "Developers are the new Kingmakers" (which he discusses in detail in another post).  I suspect that these are mutually-reinforcing phenomena.  Developer's drive for convenience has led to the rapid evolution of open source software (and not coincidentally the improvement of the tools which enable its development).  This has led to the current situation where in many cases OSS surpasses commercial offerings, or at least is "good enough" to be used in demanding production environments.  Since the developers are the ones making the software, or at least are the ones who understand it, they need to be involved in the decisions to adopt it (if they aren't spearheading the adoption themselves). (And especially since there are usually no marketing people in sight...)

In fact, I contend that the drive for convenience is the motivation for many of the advances in software and computation.  A classic example is the ongoing quest to increase the concision of computer languages, by making them more expressive and provide a richer computation model.   Another is the evolution of systems which can easily automate mundane tasks (with examples including the rich shell and command-line capabilities of *nix and the increasingly powerful array of build tools - the best of which of course are free and open source.)  Developer's preference for software which is portable and standard is also driven by convenience - everyone wants their current favourite set of software available wherever they happen to be are working (and as expected, open source software typically exhibits the highest degree of portability).

As Larry Wall said, laziness and impatience are virtues for a programmer.

Lazy programmers avoid yak shaving

Taken to the extreme, the drive for convenience is simply another way of stating the ultimate goal of all compute science - to free computation from any limitations of space, time, money and power.  Put this way, O'Grady's thesis is almost a tautology.  But it's a valuable reminder of the constant need to push against the strong opposing forces of commercial interest and bureaucratic inertia.

Wednesday, November 14, 2012

YAUSEM (Yet Another US Election Map)

The usual US election map is a starkly simplistic collection of red and blue blobs.  This does reflect the esoteric (to a Canadian) Electoral College first-past-the-post arrangement.  But after seeing how close the actual vote counts were in most states, it seemed to me like this doesn't really reflect the actual political reality of the US.  Really it's pretty much a purely purple country.  John Nelson has a nice map that elegantly visualizes this reality at a county level (and Brian Timoney explains why it's the only map that isn't a lie).

As another attempt at map truthiness, I used JEQL to produce the following map.  It shows actual vote numbers at a state level, color themed along two dimensions:
  • The hue shows the relative proportion of Democrat VS Republican votes (using the now-canonical blue and red).  For reference, Florida is almost exactly 50-50.  This nicely shows that really the US is just varying shades of purple.
  • The saturation is proportional to the relative population of the state.  California is fully saturated, since it's the most populous state.  The the inland Western and the far Northeastern states are pretty pale, since they have fairly low populations.  This is roughly proportional to the weight of the state's Electoral College votes, although there are amusing anomalies.

I make no claim that this map represents any valid statistics - it's just a fun exercise in using JEQL to do spatial visualization.  For reference, the script is:

CSVReader t colSep: "\t" file: "us_vote_raw.txt";

t = select String.trim(col1) name,
    Val.toInt(String.keepChars(col2, "0123456789")) ecVote,
    Val.toDouble(String.keepChars(col4, "0123456789")) demvote,
    Val.toDouble(String.keepChars(col5, "0123456789")) repvote from t;

Print t;

maxVote = val(select max(demvote+repvote) from t);

ShapefileReader tus file: "us_state.shp";

tvote = select name, demvote, repvote, GEOMETRY,
        demvote / (demvote+repvote) demfrac,
        demvote+repvote totvote,        

       (demvote+repvote)/maxVote density
    from t
    join tus on == String.trim(tus.STATE_NAME);

Mem tvote;

tplot = select GEOMETRY,
    #ffffff styleStroke,
    1 styleStrokeWidth
    with {
        clr = Color.interpolate("ff0000", "0000ff", demfrac);
        h = Color.getH(clr);
        s = density;
        v = 1;
        styleFill = Color.toRGBfromHSV(h,s,v);    }   
    from tvote;

extent = BOX(-128 20, -65 50);

Plot    data: tplot
    extent: val(extent)
    width: 800
    height: 400
    background: "0000aa"
    file: "us_vote.png";

The raw data came from Wikipedia via simple cut-and-paste to a text file.

Tuesday, November 13, 2012

The Great Geometry Clipping Contest

Don Meltz initiated a fascinating flurry of performance evaluation with his post on Is QGIS a Viable Alternative to ArcGIS? and its followup ArcGIS vs QGIS Clipping Contest Rematch.  He looked at a spatial processing task involving clipping a large dataset of contour lines against a fairly simple polygon.  His conclusion was that QGIS was a lot faster than ArcGIS at performing this task.  His final testing produced a time of 6 min 27 sec for QGIS, versus a time of 1 h 35 min for ArcGIS - and which failed with a topology error!  (Note:  subsequently ESRI reported that they have improved their algorithm to provide much better results for this case - and presumably to enable it to actually complete!).

This inspired a lot of other people to dive in and run the same test (since Don helpfully provided the test data here).  Systems tested include many commercial (ArcGIS, GlobalMapper, Manifold) and FOSS systems (QGIS, PostGIS, SpatialLite, GRASS, OGR, uDig, OpenJUMP, etc).  There's a summary of some of the timing results here (and the many comments to these posts provide lots of different timings on various software and hardware configurations).

On the Java side, Andrea Antonello of jGrass provides an in-depth description of his optimized implementation using uDig, GeoTools and JTS here.  His best result was about 80 sec, using 4 cores and including data I/O. (He later tested on Amazon AWS using 32 cores, giving a 25 sec time).

The SpatialLite wiki has a nice page showing how this problem is tackled in SQL here.  It also has an excellent analysis of what this contest actually demonstrates. The key conclusion is that almost all the open source systems are using JTS or GEOS, so what is really being measured is the effectiveness of the JTS overlay algorithm.

The one exception is GRASS, which uses a completely different topological algorithm.  From the results the performance of this seems similar to or only a bit slower than JTS/GEOS.  This is actually quite impressive since it sounds like the algorithm is computing full topology of the data.  But it's hard to know without a more detailed understanding of how it works.  And in any case, comparing the contest timings is difficult since many different hardware configurations were used.

One aspect of this task is that it is "pleasantly parallel", so implementations which can multi-thread over many cores should see linear performance improvement.  Only some of the systems tested were able to take advantage of this.  In particular, PostGIS and SpatialLite execute single queries in a sequential fashion, which is a shame. Perhaps this will be rectified in future releases?

It's great to see JTS and GEOS used effectively in so many different applications, and that they hold their own against the well-funded competition (and in fact often doing much better).

And here's the kicker - it could run even faster! The JTS overlay algorithm was designed for the single geometry/geometry case.  It has not been optimized for iterated operations against a fixed query geometry.  By using a caching approach similar to the existing JTS PreparedGeometry API, it will be possible to avoid  recomputing polygon topology and thus provide a significant speedup.  Also, the overlay algorithm was designed to handle all geometry types, and in particular the polygon/polygon case, which is the most demanding situation.  It could be optimized to handle simpler cases (such as the polygon/line overlay of this task) in a faster way.  All it will take is some time, money, and some hard thinking...