The second-most important criteria for a spatial algorithm is that it be
fast. (The
most important is that it's
correct!) Many spatial algorithms have a simple implementation available, but with performance of O(n
2) (or worse). This is unacceptably slow for production usage, since it results in long runtimes for data of any significant size. In
JTS a lot of effort has gone into identifying O(n
2) performance hotspots and engineering efficient replacements for them.
One long-standing hotspot is the algorithm for computing Euclidean distance between geometries. The obvious distance algorithm is a brute-force O(MxN) comparison between the vertices and edges (facets) of the input geometries. This is simple to implement, but very slow for large inputs. Surprisingly, there seems to be little in the computational geometry literature about more efficient distance algorithms. Perhaps because of this, many geometric libraries provide only the slow brute force algorithm - including JTS (until now).
Happily, it turns out there is a faster approach to distance computation. It uses data structures and algorithms which are already provided in JTS, so it's relatively easy to implement. The basic idea is to build a spatial index on each of the input geometries, and then use a
Branch-and-Bound search algorithm to efficiently traverse the index trees to find for the minimum distance between geometry facets. This is a generalization of the
R-tree Nearest Neighbour algorithm described in the
classic paper by Rousssopoulos et al. [1].
JTS has the
STRtree R-tree index implementation (a packed R-tree using the
Sort-Tile-Recursive algorithm). This has recently been enhanced with several kinds of nearest-neighbour searches. In particular, it now supports a method to find the nearest neighbours between two different trees. The
IndexedFacetDistance class uses this capability to implement fast distance searching on the facets of two geometries.
Another benefit of this approach is that it allows caching the index of one geometry. This further increases performance in the common case of repeated distance calculations against a fixed geometry.
The performance improvement is impressive. Here's the timings for computing the distance from Antarctica to other world countries:
Source
Data size |
Target
Data size |
Time
Indexed |
Time
Brute-Force |
Improvement |
1 polygon
(19,487 vertices) |
244 polygons
(366,951 vertices) |
164 ms |
136 s |
x 830 |
Branch-and-bound search also speeds up i
sWithinDistance queries. Here's a within-distance selection query between another antipodean continent and a large set of small rectangles:
Source
Data size |
Target
Data size |
Time |
Time
Brute-Force |
Improvement |
1 polygon
(7,316 vertices) |
100,000 polygons
(500,000 vertices) |
53 ms |
10.03 s |
x 19 |
A small fly in the algorithmic ointment is that Indexed Distance is not always better than the brute-force approach. For small geometries (such as points or rectangles) a simple scan is actually faster, since it avoids the overhead of building indexes. It may be possible to determine a tuning parameter that allows automatically choosing the fastest option. Or the client can choose the faster approach, using knowledge of the use case.
Future Work
A few further ideas to build or investigate:
- Implement a caching FastDistanceOp using IndexedFacetDistance and indexed Point-In-Polygon. This can be used to add a fast distance() method to PreparedGeometry
- Investigate improving isWithinDistance by using the MINMAXDISTANCE metric for envelopes. This allows earlier detection of index nodes satisfying the distance constraint.
- Investigate alternative R-Tree packing algorithms (such as Hilbert packing or sequence packing) to see if they improve performance
[1] Roussopoulos, Nick, Stephen Kelley, and Frédéric Vincent. "Nearest neighbor queries." ACM SIGMOD record. Vol. 24. No. 2. ACM, 1995.