Tuesday, 12 February 2019

Better and Faster Interior Point for Polygons in JTS/GEOS

A humble workhorse of geospatial processing is the ability to compute a point which is guaranteed to lie in the interior of a polygon.  In the OGC Simple Features for SQL specification (and hence in PostGIS) this is known as ST_PointOnSurface.  In JTS/GEOS it is called getInteriorPoint, for historical reasons [1].

Interior points for country boundaries

There are some important use cases for this capability:
  • Constructing a "proxy point" for a polygon to use in drill-down spatial query.  This has all kinds of applications:
  • Cartographic rendering including:
    • Creating leader lines
    • Placing labels for polygons (for which it is a quick solution but not necessarily a quality one.  More on this in a later post)
There is a variety of ways that have been proposed to compute interior points: triangulation, sampling via random or grid points, medial axis transform, etc.  These all involve trade-offs between location quality and performance [2].  JTS uses an approach which optimizes performance, by using a simple scan-line algorithm:

JTS Scan-Line Interior Point Algorithm
  1. Determine a Y-ordinate which is distinct to every polygon vertex Y-ordinate and close to the centre of the vertical extent
  2. Draw a scan line across the polygon and determine the segments of intersection
  3. Choose the interior point as the midpoint of the widest intersection segment
Locating a polygon interior point along a scan-line

The current code has been in the JTS codebase since the release of the very first version back in 2001.  It is elegantly simple, but is quite non-optimal, since it uses the overlay intersection algorithm.  This is overkill for the computation of the scan-line intersection segments.  It also has a couple of serious drawbacks: slow performance, and the requirement that the input polygon be valid.  These are not just theoretical concerns.  They have been noticed in the user community, and have caused client projects to have to resort to awkward workarounds.  It's even documented as a known limitation in PostGIS.

Thanks to Crunchy Data recognizing the importance of geospatial, we've been able to look into fixing this.  It turns out that a relatively simple change makes a big improvement.  The scan-line intersections can be computed via a linear-time scan of the polygon edges.  This works even for invalid input (except for a few pathological situations).
Interior points of invalid polygons
(LH invalid polygon shows suboptimal point placement)

Best of all, it's much faster - providing performance comparable to the (less useful) centroid computation.  The performance results speak for themselves:

Dataset # polys # points Time Prev
World countries
366,951 25 ms 686 ms
x 27
Land Cover 64,090 366,951 78 ms 6.35 s
x 81

This has been committed to JTS.  It will be ported to GEOS soon, and from there should show up in PostGIS (and other downstream projects like QGIS, Shapely, GDAL, etc etc).

More Ideas

Some further improvements that could be investigated:
  • Use the centroid to provide the Y-ordinate.  This is probably better in some situations, and worse in others.  But perhaps there's a fast way to choose the best one?
  • Use multiple scan-lines (both vertical and horizontal)
  • Provide better handling of short/zero-width scan-line intersections
  • Support clipping the interior point to a rectangle.  This would provide better results for cartographic labelling

[1] JTS was based on the original OGC SFS specification (version 1.1).  The spec actually does include a method Surface.pointOnSurface.  The reason for the different choice of name is lost to time, but possibly the reasoning was that the JTS function is more general, since it handles all types of geometry.  (One of the design principles of JTS is Geometric Uniformity, AKA No Surprises.  Wherever possible spatial operations are generalized to apply to all geometric types. There is almost always a sensible generalization that can be defined, and it's often quite useful.)

[1a] Also, the acronym IPA is much better than the one for PointOnSurface.

[2] Apparently Oracle has decided to provide blazingly fast performance by simply returning a point on the boundary of the polygon.  Thus proving the maxim that for every problem there is an solution which is simple, fast, and dead wrong.

Saturday, 2 February 2019

Joining Crunchy Data to work on PostGIS

I'm happy to announce that I am taking a position with Crunchy Data as a Senior Geospatial Engineer.  I'm working alongside fellow Victorian geospatial maven extraordinaire Paul Ramsey, as the core of a proposed Geospatial Data Centre of Excellence.

Our mission statement is simple:  make PostGIS bigger, better, faster!

  • Bigger - more spatial algorithms and functions
  • Better - enhance existing functionality to make it easier to use, more powerful, and more robust
  • Faster - keep looking for algorithmic optimizations and ways to use the power of Postgres to make spatial processing faster

A lot of this work will involve enhancements to the core GEOS geometry library.  Part of the goal is to keep JTS and GEOS aligned, so this should produce a nice boost to JTS as well.

Having been lurking in the background for many years now, I'm stoked to be (finally) able to work directly on PostGIS.  And I'm excited to be part of the Crunchy team. They have some of the leading Postgres experts in-house, so I'm expecting that it will be a great learning experience.  And their client list promises to expose us to some fascinating large-scale use cases for spatial data processing, which can only be good for the power and robustness of PostGIS.

I'm also looking forward to re-engaging with the geospatial open source community, and learning more about the (even bigger) open source Postgres community.  Great things lie ahead!

Monday, 28 January 2019

Hilbert and Morton Curves in JTS

I just landed a JTS pull request for Hilbert and Morton (Z-order) codes and curves.
Hilbert Curve of level 3
Morton Curve of level 3

Apart from pretty pictures of fractals, the goal is to support experimenting with Packed Hilbert R-trees, as an alternative to the current Sort-Tile-Recursive packing strategy (implemented as STRtree in JTS).  STRtrees are heavily used to speed up spatial algorithms inside JTS (and externally as per recent report). So if Hilbert curve-based packing provides better performance that would be a big win.

Thursday, 17 January 2019

Fun wit JEQL: Hilbert Curves

Hilbert Curve of order 4:

Hilbert Curve of order 6:
Code is in the JEQL script repo.

import jeql.std.function.HashFunction;

hilbertOrder = 6;
side = Val.toInt( Math.pow(2, hilbertOrder) );
count = side * side;

radius = 1;

t = select * from Generate.sequence( 0, count-2 );

t = select i, geom: Geom.buffer(hilbertEdge, 0.4)
hilbertPt1 = HashFunction.hilbertPoint(hilbertOrder, i),
hilbertPt2 = HashFunction.hilbertPoint(hilbertOrder, i+1),
hilbertEdge = Geom.createLineFromPoints( hilbertPt1, hilbertPt2 )
from t;

t1 = select *,
styleFill: clr, styleStroke: clr, styleStrokeWidth: 1
clr = Color.toRGBfromHSV(Val.toDouble(i) / count, 1, 1)
from t;
Mem t1;

The function hilbertPoint uses the efficient algorithm from http://threadlocalmutex.com/.  Code is on Github.

Thursday, 12 April 2018

The world needs a new flavour of SOSS!

Yes, you don't what the acronym SOSS means, because I just made it up. SOSS stands for Standard Open Simple Spatial format.

It's crazy that in the 21st century the most common de facto standard spatial format is based on 30-year old technology, is proprietary, and has silly limitations such as 11-character uppercase attribute names.

I'm talking, of course, about shapefiles.

Surely we can do better than this?!

Now there are actually a few things that shapefiles get right. For instance, the shapefile's simplistic tabular data model gets two full marks for being - simple and tabular! Hierarchical data models are very cool and highly expressive, but overkill and too complex for 80% of the use cases out there.

Another useful feature of shapefiles is that they store floating point data with full precision - i.e. in binary. Representing binary floating point numbers as textual decimal values is inherently lossy, and causes all kinds of subtle and annoying problems. (I'm always surprised that this doesn't crop up more often as a serious limitation of GML.)

So what are the current leading contenders for a SOSS format? Here's an opinionated list, with pros and cons

Format Pro Con
Shapefile tabular, lossless numerics proprietary, antiquated, limited
GML complex to model and parse, lossy numerics, poor schema handling open, flexible
KML proprietary, lossy, limited attribute handling, designed for presentation relatively simple, well documented
GeoRSS not appropriate as a full-featured SOSS, lossy
GeoJSON too tied to Javascript, lossy, no schema standard
YAML needs a spatial profile

Conspicuous by its absence on this list is XML. In fact XML is a meta-format, not a format. To utilize XML would require defining an appropriate profile (which would need to be highly restricted to meet the criteria of simple). The major drawback of XML is that specifying the profile almost inevitably drags one into the mind-bending hell of XML Schema. (There are other schema languages, such as RelaxNG, but they involve similar complexity and have even less traction).

There's also more esoteric formats such as NetCDF, but it fails the simplicity test, and it's unclear how well it supports Geometry types.

Friday, 21 August 2015

Variable-Width Buffers in JTS

Inspiration:  this post on the JSTS group (with an image - good job on requirements!)

JTS implementation: in the lab

Code:   Geometry geom = VariableWidthBuffer.buffer( line, 10, 80 );


SliceGraphs in JEQL

In the interests of increasing blog output, I'm going to experiment with using a terser style in posts where the content is mostly self-explanatory.  Sort of an embedded microblog...

Inspiration: original Population Lines print , this plot from the Line Graphs in R blog post:

Data: SEDAC World Population grid, subsampled

JEQL script


Wednesday, 15 July 2015

Even-distribution Random Point and Polygons in JTS

Recently I fixed the JTS KD-Tree implementation so that it works as advertised with a distance tolerance to provide point snapping.  This gives a fast way to produce random point fields with even distribution (i.e. no points too close together).

First, generate a batch of random points using RandomPointsBuilder.  As is well known, this produces a very "lumpy" distribution of points:

Then, put them in a KD-Tree using a snapping distance tolerance.  Querying all points in the final tree produces a nice even distribution of points:

Using the Concave Hull algorithm available here with the same distance tolerance produces a random polygon with a very pleasing appearance:
I suspect that these kinds of polygons might be useful for generating stress tests for geometric algorithms.

UPDATE: Adding a bit of Bezier Smoothing produces an even cooler-looking polygon:

Friday, 15 November 2013

Maslow's Hierarchy in the 21st Century

Been a while since I posted, so posting some humour seems like a good start to getting back on track...

Wednesday, 12 June 2013

How to get OpenLayers WMSGetFeatureInfo to emit GeoServer CQL Filters for multiple layers

OpenLayers provides the useful WMSGetFeatureInfo control.  It's designed to work with the standard WMS GetFeatureInfo request.  As per the standard, the control supports querying multiple layers via setting the layers property.

It's often necessary to define client-side filters for WMS layers, to display only a subset of the layer data in the backing feature type. Usually the filters need to be defined dynamically, based on the application context.   When using GeoServer as the web mapping engine a convenient (but non-standard) way of doing this is to use the CQL_FILTER WMS parameter. (One might reasonably ask why there isn't an equally simple way to do this in the WMS standard itself, but that's another story).  In OpenLayers this parameter can be added dynamically to a layer via the mergeNewParams method:

lyr.mergeNewParams({'CQL_FILTER': "filter expression" });

Naturally it is necessary to have the GetFeatureInfo control respect the layer filters as well.  This is straightforward in the case of a single layer.  The GeoServer CQL_FILTER parameter can be supplied using the  vendorParams property on the WMSGetFeatureInfo control:

infoControl.vendorParams = { 'CQL_FILTER': 'filter expression'};

Since the CQL_FILTER parameter supports a list of filters, it's also straightforward to filter multiple layers as long as the list of layers queried is static:

infoControl.vendorParams = { 'CQL_FILTER': 'filt-1; filt-2; filt-3'};

But WMSGetFeatureInfo also provides the useful ability to query only visible layers (via the queryVisible property).  This makes things much trickier, since the list of filter expressions must match the list of layers provided in the QUERY_LAYERS parameter.  There's no built-in way of doing this in OpenLayers itself (not surprisingly, since the CQL_FILTER parameter syntax is specific to GeoServer only).

One way to do this is to build the CQL_FILTER parameter value dynamically uisng the CQL_FILTERs defined for the visible layers.  This can be done when the control is invoked, via hooking the beforegetfeatureinfo event.

Here's a code snippet to do this:

var infoControl;

function initInfoControl()
  infoControl = new OpenLayers.Control.WMSGetFeatureInfo({
		url: wms_url,
		title: 'Identify features by clicking',
		layers: [
		queryVisible: true,
		maxFeatures: 3,
		infoFormat: 'application/vnd.ogc.gml'
    "beforegetfeatureinfo", null, onBeforeGetFeatureInfo); 
    ("getfeatureinfo", null, onGetFeatureInfo); 
function onBeforeGetFeatureInfo(event)
  // build CQL_FILTER param list from active info layer CQL_FILTER params
  var layers = infoControl.findLayers();
  var filter = "";
  for (var i = 0, len = layers.length; i < len; i++) {
    if (i > 0) 	filter += ";";
    var lyrCQL = layers[i].params.CQL_FILTER
    if (lyrCQL != null) {
      filter += lyrCQL;
  infoControl.vendorParams = { 'CQL_FILTER': filter	};

Although climbing up the OpenLayers learning curve often feels like a big struggle, it's important to recognize the very wide set of requirements that the library is trying to address.  Due to the nature of spatial data, user interfaces and protocols dealing with it are inherently complex.   The more I work with OpenLayers, the more appreciation I have for the fine balance between simplicity and flexibility the designers have achieved. (And if that sounds like I do not subscribe to the "Spatial is not special" canard, you're hearing me right!).

Wednesday, 29 May 2013

Flight Paths in JEQL Redux

The intertubes are buzzing about a flight path visualization done by Michael Markieta.  This is based on the same OpenFlights dataset that I used a couple of years ago as a demonstration of JEQL processing and visualization capabilities.

Markieta's blog post outlines his workflow using ArcGIS.  It's a bit cumbersome - apart from having to jump through hoops to read the data from the original DAT files, apparently the dataset has to be split into six parts to be able to process it.  (For a measly 58K rows?!)

No details are provided about styling, which is the key part of the exercise.   The images apparently use alpha blending to show flight density.  Also, the coordinate system seems to be more curvaceous than the squaresville Plate Carree I used (so much more haute couture than saying Lat/Long). Both of these are easy to do in JEQL.  Here's some samples of the improved output, using the alpha channel and a Mollwiede projection.

North America

And here's the entire image, in glorious hi-res suitable for framing:

Monday, 13 May 2013

Beautiful cartography using OpenJUMP

An OpenJUMP user just posted some really nice cartographic maps made using a combination of OpenJUMP, Inkscape, GRASS, and GIMP.

He gives OJ the following glowing endorsement:
I find Open JUMP to be the most vector-friendly open source GIS software. The preparation of the datasets (rivers, lakes, sea, roads, borders) was really [a] piece of cake...
It's great to see the small but dedicated OpenJUMP community steadily adding new features and improving the software quality.  10 years after it was launched, OpenJUMP continues to be the "Little Open-Source GIS that Can".

Thursday, 14 March 2013

10 Step Program for Developers

Andrew Oliver lays out the 10 Step Program for developers.  Here's his points:
  1. Blog
  2. Go open source
  3. Not six months, not 10 years
  4. Eye on the new stuff, hands on the practical
  5. Write your own documentation
  6. Brevity is the soul
  7. Wow the crowd
  8. Be realistic
  9. Solve the hard stuff, know the tools  (hmm.. isn't that two points?)
  10. Practice humility
This blog post is my practice of points 1 and 6.  And also one of my own:

11. Copy the work of other smart people

Tuesday, 12 February 2013

The subversiveness of Open Source

It's no longer novel to observe that Open Source is, if not the dominant software paradigm of the era, at least one of the most significant innovations in the history of software practice.  Recently it struck me how downright bizarre the Open Source paradigm really is.  I can't think of another field of human endeavour where the fundamental paradigm mandates giving away the product of one's labour.  Consider a few sweepingly-generalized examples:
  • Business - Fugedaboudit!  It's all about the money.  Apart from the Diggers of 60's Haight-Ashbury notoriety there aren't too many examples of businesses whose model consists of giving away their stock.
  • Arts - Hah!  Obviously the big media companies are doing everything they can to squeeze money out of artistic endeavour.  But even among the less mercantile stakeholders the main discussion is about how artists can be compensated for their creations.  No-one seriously advocates that artists give away all their work for free. 
  • Sport -  Don't get me started on the gross discrepancy between compensation and value in professional sport.  And at the amateur level, sponsorship and funding organizations are recognized to be essential to promoting the continued generation of sporting "product".  (Wouldn't it be great if there was a similar system of sponsorship for software developers?)
  • Science - You might think this would be the exception that proves the rule.  After all, sharing research results is a revered principle of scientific progress.  The domain relies on publishing information openly to an even greater extent than in software development.  But in my (admittedly limited) experience many scientists are actually quite protective of their intellectual property, since their livelihood depends in a direct way on amassing it and monetizing how it is dispensed.  And it's well known that academic institutions pay very close attention to licensing the IP generated by them (or their employees).
Just to be clear, I am not suggesting that the open source paradigm is flawed or wrong.  In fact, I spend the major part of my professional life living and breathing Open Source geospatial software (JTS/GEOS, JEQL, Proj4J, GeoServer, PostGIS, etc). As a means of increasing the velocity and quality of software development it's by far the best model. And it's much more democratic and self-actualizing than the semi-feudal alternatives.  But it really is a subversive concept.  Marxist, even.  It's no wonder that it's taking so long for the suits to wrap their heads around how to deal with it.

Long live the anarcho-syndicalist commune of Open Source Software craftsmen!

Wednesday, 6 February 2013

JTS Union VS ArcGIS Dissolve

Ragnvald Larsen has an interesting post on ways to mitigate the poor performance and stabilty of Dissolve computations in ArcGIS.  Dissolve is the Arc term for the geometric union of a collection of polygons (possibly grouped by attribute, although that capability was not used in this case).

Ragnvald's dataset consisted of a 15 MB shapefile containing about 7000 overlapping polygons.  Here's what the data looks like:

He found that using the ArcGIS Dissolve method took about 150 sec to process the dataset.  In an effort to reduce this time, he experimented with partitioning the dataset and doing the union in batches.  After a (presumably lengthy) series of experiments to find the optimal batch size, he was able to get the time down to 25 sec using a batch size of 110 features.

Improving union performance by partioning the input is the basic idea behind the Cascaded Union function in JTS (which I blogged about back in 2007).  Cascaded Union uses a spatial index to automatically optimize the partitioning.  Ragnvald doesn't mention whether he used a spatial index, but I suspect this might be quite time-consuming to code in ArcPy.

I thought it would be interesting to compare the performance of the JTS algorithm to the ArcGIS one.  To do this I used JEQL, which provides an easy high-level way to read the data and invoke the JTS Cascaded Union.  The entire process can be expressed as a very simple JEQL script:

ShapefileReader t file: "agder/agder_buffer.shp";
t = select geomUnionMem(GEOMETRY) g from t;
ShapefileWriter t file: "result.shp";

geomUnionMem is a JEQL spatial aggregate function which is implemented using the JTS Cascaded Union algorithm.  (Although not needed in this case, note that the more general Dissolve use case of unioning groups of features by their attributes can easily be achieved by using the standard SQL GROUP BY clause.)

Running this on a (late-model) PC workstation produced a timing of about 1.5 sec!

Here's the output union:

Thursday, 3 January 2013

Functional Programming Whinging

Tim Bray thinks Uncle Bob Martin's post on Functional Programming Basics is the cat's pyjamas.

Meh. "Basics" is the key word in that title - the article is pretty light and fluffy.  Fine if you don't know squat about FP, but it's also accompanied by a whole lot of starry-eyed razzle-dazzle which isn't really justified by the content (and note that I'm not saying it's wrong, just not substantiated).

To be fair, TB does have a few gripes.  Here's a few more:

  • The example used to show how FP wonderfully avoids variables and side-effects is that hoary old one of computing squares of integers.  (I mean really hoary - this was the first program I ever wrote, in WATFIV.  And I at least had cool line printer output!)  How about using something that's a bit more representative of an actual computational problem?  Like say, red-black trees - with deletion!  
  • As TB points out, the people who really need to make algorithms run fast across 64 cores are a small percentage of current coders.  For everyone else, scale-out is a more mundane but pressing problem.  And it's not clear to me whether FP will make that easier.
  • As someone who spends his leisure hours trying to make spatial algorithms more performant, I'm suspicious of anything that promises to automagically make code go faster across multiple cores.  In spatial most interesting problems are not "pleasantly parallel", and many of them are memory-bound as well as being compute-bound.  So advances in performance would seem depend on better algorithms, not a different choice of language.
Back in the day I was pretty keen on FP languages - but I realized after being exposed to Smalltalk and later Java, a lot of their appeal was due to their (necessary) provision of automatic memory management (which was painfully lacking in the "mainstream" languages such as FORTRAN, Pascal, C - oh, and even C++).

But I'm not trying to prove a negative here.  Certainly the FP features of no side-effects and lazy evaluation would seem to offer a lot of benefit for the right class of problems.  And FP or FP-ish languages are more mainstream than ever before.  So perhaps they really will become the mainstream language paradigm.  I just hope I don't have to be coding using layers of inconveniently situated parentheses.

Lead, the criminal element

I've heard before about the postulated link between atmospheric lead levels (courtesy of the leaded gasoline used through the middle decades of the 20th century) and crime levels.  This Mother Jones article America's Real Criminal Element: Lead is the best explanation I've seen so far (and has links to the original papers).  It really sounds like this hypothesis is fully confirmed - and the best thing about this story is that it has a happy ending.  (Unless you're trying to get elected as mayor - or Prime Minister - on a tough-on-crime platform).

There is a nice geospatial connection here.  As with many epidemiological issues, spatial locality is an important aspect of the analyses that lead (ahem) to the conclusion.  The article is chock-full of references to the spatial nature of the problem, such as:
We now have studies at the international level, the national level, the state level, the city level, and even the individual level
and my favourite:
a good rule of thumb for categorizing epidemics: If it spreads along lines of communication, he says, the cause is information. Think Bieber Fever. If it travels along major transportation routes, the cause is microbial. Think influenza. If it spreads out like a fan, the cause is an insect. Think malaria. But if it's everywhere, all at once—as both the rise of crime in the '60s and '70s and the fall of crime in the '90s seemed to be—the cause is a molecule.

Wednesday, 2 January 2013

2012 Year in Review - Blog Roundup

A look back at 2012 from a software technology perspective by some of my favourite blogs:
  • Inspired By Actual Events - a wide-reaching roundup. I found the Java and friends links especially interesting, since the Java/JVM world is so big now it's hard to keep up with and distill the really significant events.
  • Interoperability Happens (Ted Nedward) - As usual, opinionated and insightful commentary on enterprise software technology from a hard-core developer perspective.
  • Tim Anderson - A strong focus on Microsoft, but also fairly even-handed assessment of the rest of the "A"-team (Apple, Android/Google, and Amazon).  (And a not-very-optimistic mention of the "B" team - BB/RIM). I always appreciate Tim Anderson's reading of the internal and external tea-leaves of MS technology.  It's always fascinating to see the elephant trying to jump, in a schadenfreudal sort of way.
  • Tim Bray - Not really a roundup, and not all that tech-focussed, but always a good read.