Wednesday 14 November 2012

YAUSEM (Yet Another US Election Map)

The usual US election map is a starkly simplistic collection of red and blue blobs.  This does reflect the esoteric (to a Canadian) Electoral College first-past-the-post arrangement.  But after seeing how close the actual vote counts were in most states, it seemed to me like this doesn't really reflect the actual political reality of the US.  Really it's pretty much a purely purple country.  John Nelson has a nice map that elegantly visualizes this reality at a county level (and Brian Timoney explains why it's the only map that isn't a lie).

As another attempt at map truthiness, I used JEQL to produce the following map.  It shows actual vote numbers at a state level, color themed along two dimensions:
  • The hue shows the relative proportion of Democrat VS Republican votes (using the now-canonical blue and red).  For reference, Florida is almost exactly 50-50.  This nicely shows that really the US is just varying shades of purple.
  • The saturation is proportional to the relative population of the state.  California is fully saturated, since it's the most populous state.  The the inland Western and the far Northeastern states are pretty pale, since they have fairly low populations.  This is roughly proportional to the weight of the state's Electoral College votes, although there are amusing anomalies.


I make no claim that this map represents any valid statistics - it's just a fun exercise in using JEQL to do spatial visualization.  For reference, the script is:

CSVReader t colSep: "\t" file: "us_vote_raw.txt";

t = select String.trim(col1) name,
    Val.toInt(String.keepChars(col2, "0123456789")) ecVote,
    Val.toDouble(String.keepChars(col4, "0123456789")) demvote,
    Val.toDouble(String.keepChars(col5, "0123456789")) repvote from t;

Print t;

maxVote = val(select max(demvote+repvote) from t);

ShapefileReader tus file: "us_state.shp";

tvote = select name, demvote, repvote, GEOMETRY,
        demvote / (demvote+repvote) demfrac,
        demvote+repvote totvote,        

       (demvote+repvote)/maxVote density
    from t
    join tus on t.name == String.trim(tus.STATE_NAME);

Mem tvote;
 

tplot = select GEOMETRY,
     styleFill,
    #ffffff styleStroke,
    1 styleStrokeWidth
    with {
        clr = Color.interpolate("ff0000", "0000ff", demfrac);
        h = Color.getH(clr);
        s = density;
        v = 1;
        styleFill = Color.toRGBfromHSV(h,s,v);    }   
    from tvote;

extent = BOX(-128 20, -65 50);

Plot    data: tplot
    extent: val(extent)
    width: 800
    height: 400
    background: "0000aa"
    file: "us_vote.png";


The raw data came from Wikipedia via simple cut-and-paste to a text file.


Tuesday 13 November 2012

The Great Geometry Clipping Contest

Don Meltz initiated a fascinating flurry of performance evaluation with his post on Is QGIS a Viable Alternative to ArcGIS? and its followup ArcGIS vs QGIS Clipping Contest Rematch.  He looked at a spatial processing task involving clipping a large dataset of contour lines against a fairly simple polygon.  His conclusion was that QGIS was a lot faster than ArcGIS at performing this task.  His final testing produced a time of 6 min 27 sec for QGIS, versus a time of 1 h 35 min for ArcGIS - and which failed with a topology error!  (Note:  subsequently ESRI reported that they have improved their algorithm to provide much better results for this case - and presumably to enable it to actually complete!).


This inspired a lot of other people to dive in and run the same test (since Don helpfully provided the test data here).  Systems tested include many commercial (ArcGIS, GlobalMapper, Manifold) and FOSS systems (QGIS, PostGIS, SpatialLite, GRASS, OGR, uDig, OpenJUMP, etc).  There's a summary of some of the timing results here (and the many comments to these posts provide lots of different timings on various software and hardware configurations).

On the Java side, Andrea Antonello of jGrass provides an in-depth description of his optimized implementation using uDig, GeoTools and JTS here.  His best result was about 80 sec, using 4 cores and including data I/O. (He later tested on Amazon AWS using 32 cores, giving a 25 sec time).

The SpatialLite wiki has a nice page showing how this problem is tackled in SQL here.  It also has an excellent analysis of what this contest actually demonstrates. The key conclusion is that almost all the open source systems are using JTS or GEOS, so what is really being measured is the effectiveness of the JTS overlay algorithm.

The one exception is GRASS, which uses a completely different topological algorithm.  From the results the performance of this seems similar to or only a bit slower than JTS/GEOS.  This is actually quite impressive since it sounds like the algorithm is computing full topology of the data.  But it's hard to know without a more detailed understanding of how it works.  And in any case, comparing the contest timings is difficult since many different hardware configurations were used.

One aspect of this task is that it is "pleasantly parallel", so implementations which can multi-thread over many cores should see linear performance improvement.  Only some of the systems tested were able to take advantage of this.  In particular, PostGIS and SpatialLite execute single queries in a sequential fashion, which is a shame. Perhaps this will be rectified in future releases?

It's great to see JTS and GEOS used effectively in so many different applications, and that they hold their own against the well-funded competition (and in fact often doing much better).

And here's the kicker - it could run even faster! The JTS overlay algorithm was designed for the single geometry/geometry case.  It has not been optimized for iterated operations against a fixed query geometry.  By using a caching approach similar to the existing JTS PreparedGeometry API, it will be possible to avoid  recomputing polygon topology and thus provide a significant speedup.  Also, the overlay algorithm was designed to handle all geometry types, and in particular the polygon/polygon case, which is the most demanding situation.  It could be optimized to handle simpler cases (such as the polygon/line overlay of this task) in a faster way.  All it will take is some time, money, and some hard thinking...