Monday, 25 March 2019

PostgreSQL's Linux moment

My perspicacious colleague Paul Ramsey says "Postgres is having its Linux moment". There is certainly a buzz around PostgreSQL, evidenced by datapoints such as:
Reasons for this include:
  • the shift from proprietary to open source as the software model of choice
  • the rise of cloud-based DB platforms, where the flexibility, power and cost (free!) of Postgres makes it an obvious choice.  All the major players now have Postgres cloud offerings, including Amazon, Microsoft and Google.
And happily riding along is PostGIS, bundled with most if not all major PostgreSQL distros.  (Note how the Google blog post announcing cloud Postgres highlights a geospatial use case).  So it's an exciting time to be able to work on PostGIS at Crunchy Data.


Monday, 4 March 2019

Fast Geometry Distance in JTS


The second-most important criteria for a spatial algorithm is that it be fast.  (The most important is that it's correct!)  Many spatial algorithms have a simple implementation available, but with performance of O(n2) (or worse).  This is unacceptably slow for production usage, since it results in long runtimes for data of any significant size. In JTS a lot of effort has gone into identifying O(n2) performance hotspots and engineering efficient replacements for them.

One long-standing hotspot is the algorithm for computing Euclidean distance between geometries.  The obvious distance algorithm is a brute-force O(MxN) comparison between the vertices and edges (facets) of the input geometries.  This is simple to implement, but very slow for large inputs.  Surprisingly, there seems to be little in the computational geometry literature about more efficient distance algorithms.  Perhaps because of this, many geometric libraries provide only the slow brute force algorithm - including JTS (until now). 

Happily, it turns out there is a faster approach to distance computation.  It uses data structures and algorithms which are already provided in JTS, so it's relatively easy to implement. The basic idea is to build a spatial index on each of the input geometries, and then use a Branch-and-Bound search algorithm to efficiently traverse the index trees to find for the minimum distance between geometry facets.  This is a generalization of the R-tree Nearest Neighbour algorithm described in the classic paper by Rousssopoulos et al.  [1]. 

JTS has the STRtree R-tree index implementation (a packed R-tree using the Sort-Tile-Recursive algorithm). This has recently been enhanced with several kinds of nearest-neighbour searches.  In particular, it now supports a method to find the nearest neighbours between two different trees.  The IndexedFacetDistance class uses this capability to implement fast distance searching on the facets of two geometries.

Another benefit of this approach is that it allows caching the index of one geometry.  This further increases performance in the common case of repeated distance calculations against a fixed geometry.

The performance improvement is impressive.  Here's the timings for computing the distance from Antarctica to other world countries:

Source
Data size
Target
Data size
Time
Indexed
Time
Brute-Force
Improvement
1 polygon
(19,487 vertices)
244 polygons
(366,951 vertices)
164 ms 136 s x 830

Branch-and-bound search also speeds up isWithinDistance queries.  Here's a within-distance selection query between another antipodean continent and a large set of small rectangles:

Source
Data size
Target
Data size
Time Time
Brute-Force
Improvement
1 polygon
(7,316 vertices)
100,000 polygons
(500,000 vertices)
53 ms 10.03 s x 19

A small fly in the algorithmic ointment is that Indexed Distance is not always better than the brute-force approach.  For small geometries (such as points or rectangles) a simple scan is actually faster, since it avoids the overhead of building indexes.  It may be possible to determine a tuning parameter that allows automatically choosing the fastest option.  Or the client can choose the faster approach, using knowledge of the use case.

Future Work

A few further ideas to build or investigate:
  • Implement a caching FastDistanceOp using IndexedFacetDistance and indexed Point-In-Polygon.  This can be used to add a fast distance() method to PreparedGeometry 
  • Investigate improving isWithinDistance by using the MINMAXDISTANCE metric for envelopes.  This allows earlier detection of index nodes satisfying the distance constraint.
  • Investigate alternative R-Tree packing algorithms (such as Hilbert packing or sequence packing) to see if they improve performance

[1] Roussopoulos, Nick, Stephen Kelley, and Frédéric Vincent. "Nearest neighbor queries."  ACM SIGMOD record. Vol. 24. No. 2. ACM, 1995.