Tuesday, February 12, 2013

The subversiveness of Open Source

It's no longer novel to observe that Open Source is, if not the dominant software paradigm of the era, at least one of the most significant innovations in the history of software practice.  Recently it struck me how downright bizarre the Open Source paradigm really is.  I can't think of another field of human endeavour where the fundamental paradigm mandates giving away the product of one's labour.  Consider a few sweepingly-generalized examples:
  • Business - Fugedaboudit!  It's all about the money.  Apart from the Diggers of 60's Haight-Ashbury notoriety there aren't too many examples of businesses whose model consists of giving away their stock.
  • Arts - Hah!  Obviously the big media companies are doing everything they can to squeeze money out of artistic endeavour.  But even among the less mercantile stakeholders the main discussion is about how artists can be compensated for their creations.  No-one seriously advocates that artists give away all their work for free. 
  • Sport -  Don't get me started on the gross discrepancy between compensation and value in professional sport.  And at the amateur level, sponsorship and funding organizations are recognized to be essential to promoting the continued generation of sporting "product".  (Wouldn't it be great if there was a similar system of sponsorship for software developers?)
  • Science - You might think this would be the exception that proves the rule.  After all, sharing research results is a revered principle of scientific progress.  The domain relies on publishing information openly to an even greater extent than in software development.  But in my (admittedly limited) experience many scientists are actually quite protective of their intellectual property, since their livelihood depends in a direct way on amassing it and monetizing how it is dispensed.  And it's well known that academic institutions pay very close attention to licensing the IP generated by them (or their employees).
Just to be clear, I am not suggesting that the open source paradigm is flawed or wrong.  In fact, I spend the major part of my professional life living and breathing Open Source geospatial software (JTS/GEOS, JEQL, Proj4J, GeoServer, PostGIS, etc). As a means of increasing the velocity and quality of software development it's by far the best model. And it's much more democratic and self-actualizing than the semi-feudal alternatives.  But it really is a subversive concept.  Marxist, even.  It's no wonder that it's taking so long for the suits to wrap their heads around how to deal with it.

Long live the anarcho-syndicalist commune of Open Source Software craftsmen!

Wednesday, February 6, 2013

JTS Union VS ArcGIS Dissolve

Ragnvald Larsen has an interesting post on ways to mitigate the poor performance and stabilty of Dissolve computations in ArcGIS.  Dissolve is the Arc term for the geometric union of a collection of polygons (possibly grouped by attribute, although that capability was not used in this case).

Ragnvald's dataset consisted of a 15 MB shapefile containing about 7000 overlapping polygons.  Here's what the data looks like:

He found that using the ArcGIS Dissolve method took about 150 sec to process the dataset.  In an effort to reduce this time, he experimented with partitioning the dataset and doing the union in batches.  After a (presumably lengthy) series of experiments to find the optimal batch size, he was able to get the time down to 25 sec using a batch size of 110 features.

Improving union performance by partioning the input is the basic idea behind the Cascaded Union function in JTS (which I blogged about back in 2007).  Cascaded Union uses a spatial index to automatically optimize the partitioning.  Ragnvald doesn't mention whether he used a spatial index, but I suspect this might be quite time-consuming to code in ArcPy.

I thought it would be interesting to compare the performance of the JTS algorithm to the ArcGIS one.  To do this I used JEQL, which provides an easy high-level way to read the data and invoke the JTS Cascaded Union.  The entire process can be expressed as a very simple JEQL script:

ShapefileReader t file: "agder/agder_buffer.shp";
t = select geomUnionMem(GEOMETRY) g from t;
ShapefileWriter t file: "result.shp";

geomUnionMem is a JEQL spatial aggregate function which is implemented using the JTS Cascaded Union algorithm.  (Although not needed in this case, note that the more general Dissolve use case of unioning groups of features by their attributes can easily be achieved by using the standard SQL GROUP BY clause.)

Running this on a (late-model) PC workstation produced a timing of about 1.5 sec!

Here's the output union: