Monday, 27 May 2024

RelateNG Performance

A previous post introduced a new algorithm in the JTS Topology Suite called RelateNG.  It computes topological relationships between geometries using the Dimensionally-Extended 9 Intersection Model  (DE-9IM) model.  

This algorithm is fundamental to a large proportion of spatial queries executed in numerous geospatial environments.  It would not be surprising to learn that the Relate algorithm is executed billions of times per day across the world's data centres.  Given this, performance is a critical metric.  

The original JTS RelateOp algorithm was hamstrung by its design, which required computing a full topology graph for each relationship evaluation.  The subsequent development of PreparedGeometry provided a significant performance boost for common spatial predicates such as intersects and covers.  A useful side-effect is that it is less susceptible to geometric robustness problems.  However, it has some drawbacks:

  • it only implements a small set of predicates
  • the design does not easily extend to more complex predicates
  • it does not support GeometryCollection inputs
  • it is a completely separate codebase to the general-purpose RelateOp, which increases maintenance effort and increases the risk of bugs and behaviour discrepancies
RelateNG solves all of these problems.  In addition, it provides much better performance for non-prepared (stateless) calls.  This is important for systems which do not execute spatial queries in a stateful way (due to architectural limitations, or simply lack of attention to performance), and thus cannot use prepared mode.

Performance Tests

Here are some examples of performance metrics.  
Note that the data used in the tests is sometimes subsetted to avoid skewing the results with many false results due to non-interaction between the input geometries.  This reflects the typical deployment of these predicates in systems which perform a primary extent filter (i.e. via a spatial index) before executing the full predicate.
Some of datasets are in geodetic coordinates, which strictly speaking JTS does not handle. But this does not matter for the purposes of performance testing.

Point / Polygon

This test queries a MultiPolygon of the outline of Europe (including larger islands) against a synthetic dataset of random points.  The test evaluates the intersects predicate, which is really the only one of interest in the Point / Polygon case.  

The test data metrics are:
  • Query MultiPolygon: 461 polygons, 22,675 vertices
  • Target Points: 10,000 points

Timings

OpIntersectsIntersects Prep
Relate61.7 s21 ms
RelateNG0.1 s19 ms

The result clearly shows the enormous improvement in the stateless case, since RelateNG does not build topology on the input polygon.  The results are very similar in the prepared case.  This is as expected, since both codebases run a simple Point-in-Polygon test using a cached spatial index.

Line / Line

This test uses a dataset of major rivers of the continental United States.  A single river mainstem is queried against a subset of river branches using the intersects and touches relationships, in stateless and prepared modes.

The test data metrics are:
  • Query Line: 6,975 vertices
  • Target Lines: 407 lines with 47,328 vertices
 

Timings (per 100 executions of the operation)

OpIntersectsIntersects PrepTouchesTouches Prep
Relate38.2 s133 ms36 sN/A
RelateNG1.18 s142 ms2.05 s2.03 s

This shows the much better performance of RelateNG.  RelateNG can evaluate touches in prepared mode, but the performance is similar to stateless mode, since currently indexes are not cached for Line/Line cases.  This could be improved in a future release.

Polygon / Polygon

This test uses data from two polygonal datasets:
The tests query an administrative unit polygon against a subset of Bedrock polygons which intersect its extent, using intersects and covers in stateless and prepared modes. 

The test data metrics are:
  • GADM unit: 4,017 vertices
  • Bedrock polygons: 4,318 polygon with 337,650 vertices

Timings (per 100 executions of the operation)

OpIntersectsIntersects PrepCoversCovers Prep
Relate61.7 s534 ms54.9 s842 ms
RelateNG5.8 s595 ms6.4 s943 ms

Polygon / Polygon - Custom Relate Pattern

The tests shows the ability of RelateNG to efficiently evaluate arbitrary Relate Intersection Matrix patterns.  In this case, the pattern used is F***0****, which corresponds to a relationship which could be called "point-touches": two geometries have boundaries which intersect in only one (or more) points (dimension 0), but whose interiors do not intersect.

This test uses data from The GADM Level 1 boundaries for Canada.  This has the advantage that Canada contains a rare example of 4 boundaries meeting at a single point (the jurisdictions of Saskatchewan, Manitoba, Northwest Territories, and Nunavut).

The test data metrics are:
  • GADM Canada Level 1: 13 polygons, 4,005,926 vertices

Timings

OpF***0****F***0**** Prep
Relate504 sN/A
RelateNG9.8 s6.6 s

As expected, for this test RelateNG is far more performant than Relate.  This is due to its ability to analyze Intersection Matrix patterns and perform only topology tests that are necessary, as well as not building a full topological structure of the inputs.  The test shows the effect of caching spatial indexes in prepared mode, although stateless mode is very efficient as well.

Analysis of Results

Clearly RelateNG is far more performant than Relate when running in non-prepared mode.  The PreparedGeometry implementation is slightly faster (which confirms the efficiency of its original design), but not by very much.  The difference may be a consequence of the general-purpose and thus more complex code and data structures in RelateNG. Closing this gap may be an area for future research.
One thing is certain: the RelateNG algorithm design provides much more scope for adding case-specific optimizations.  If you have a significant use case which could be improved by further targeted optimization improvements, let me know in the comments!
It will be interesting to re-evaluate these tests once RelateNG is ported to GEOS.  Sometimes (but not always) a C++ implementation can be significantly faster than Java, due to more opportunities for code and compiler optimization.

Thursday, 16 May 2024

JTS Topological Relationships - the Next Generation

The most fundamental and widely-used operations in the JTS Topology Suite are the ones that evaluate topological relationships between geometries.  JTS implements the Dimensionally-Extended 9 Intersection Model (DE-9IM), as defined in the OGC Simple Features specification, in the RelateOp API.  

DE-9IM matrix for overlapping polygons

The RelateOp algorithm was the very first one implemented during the initial JTS development, over 20 years ago.  At that time it was an appealing idea to implement a general-purpose topology framework (the GeometryGraph package), and use it to support topological predicates, overlay, and buffering.  However, some disadvantages of this approach have become evident over time:

  • the need to create a topological graph structure limits the ability to improve performance. This has led to the implementation of PreparedGeometry - but that adds further complexity to the codebase, and supports only a limited set of predicates.
  • a large number of code dependencies make it hard to fix problems and improve semantics
  • constructing a full topology graph increases exposure to geometric robustness errors
  • GeometryCollections are not supported (initially because the OGC did not define the semantics for this, and now because adding this capability is difficult)
There's an extensive list of issues relating (ha!) to RelateOp here.

The importance of this functionality is especially significant since the same algorithm is implemented in GEOS.  That codebase is used to evaluate spatial queries in popular spatial APIs such as Shapely and R-sf, and numerous systems such as PostGISDuckDBSpatialLiteQGIS, and GDAL (to name just a few).  It would not be surprising to learn that the RelateOp algorithm is executed billions of times per day across the world's CPUs.

During the subsequent years of working on JTS I realized that there was a better way to evaluate topological relationships.  It would required a ground-up rewrite, but would avoid the shortcomings of RelateOp and provide better performance and a more tractable codebase.  Thanks to my employer Crunchy Data I have finally been able to make this idea a reality.  Soon JTS will provide a new algorithm for topological relationships called RelateNG.

Key Features of RelateNG

The RelateNG algorithm incorporates a broad spectrum of improvements over RelateOp in the areas of functionality, robustness, and performance.  It provides the following features:

  • Efficient short-circuited evaluation of topological predicates (including matching custom DE-9IM matrix patterns)
  • Optimized repeated evaluation of predicates against a single geometry via cached spatial indexes (AKA "prepared mode")
  • Robust computation (only point-local geometry topology is computed, so invalid topology does not cause failures)
  • GeometryCollection inputs containing mixed types and overlapping polygons are supported, using union semantics.
  • Zero-length LineStrings are treated as being topologically identical to Points.
  • Support for BoundaryNodeRules.

Using the RelateNG API

The main entry point is the RelateNG class.  It supports evaluating topological relationships in three different ways:

  • Evaluating a standard OGC named boolean binary predicate, specified via a TopologyPredicate instance.  Standard predicates are obtained from the RelatePredicate factory functions intersects, contains, overlaps, etc.
  • Testing an arbitrary DE-9IM relationship by matching an intersection matrix pattern (e.g. "T**FF*FF*", which is the pattern for a relation called Contains Properly).
  • Computing the full value of a DE-9IM IntersectionMatrix.
RelateNG operations can be executed in two modes: stateless and prepared.  Stateless mode is provided by the relate static functions, which accept two input geometries and evaluate a predicate.  For prepared mode, a single geometry is provided to the prepare functions to create a RelateNG instance.  Then the evaluate instance methods can be used to evaluate spatial predicates on a sequence of geometries. The instance creates spatial indexes (lazily) to make topology evaluation much more efficient.  Note that even stateless mode can be significantly faster than the current implementation, due to short-circuiting and other internal heuristics.

Here is an example of matching an intersection matrix pattern, in stateless mode:
boolean isMatched = RelateNG.relate(geomA, geomB, "T**FF*FF*");
Here is an example of setting up a geometry in prepared mode, and evaluating a named predicate on it:
RelateNG rng = RelateNG.prepare(geomA);
for (Geometry geomB : geomSet) {
    boolean predValue = rng.evaluate(geomB, RelatePredicate.intersects());
}

Rolling It Out

It's exciting to launch a major improvement on such a core piece of spatial functionality.  The Crunchy spatial team will get busy on porting this algorithm to GEOS.  From there it should get extensive usage in downstream projects.  We're looking forward to hearing feedback from our own PostGIS clients as well as other users.  We're always happy to be able to reduce query times and equally importantly, carbon footprints.  

In further blog posts I'll describe the RelateNG algorithm design and provide some examples of performance metrics.

Future Ideas

The RelateNG implementation provides an excellent foundation to build out some interesting extensions to the fundamental DE-9IM concept.

Extended Patterns

The current DE-9IM pattern language is quite limited.  In fact, it's not even powerful enough to express the standard named predicates.  It could be improved by adding features like:

  • disjunctive combinations of patterns.  For example, touches is defined by  "FT******* | F**T***** | F***T****"
  • dimension guards to specify which dimensions a pattern applies to.  For example, overlaps is defined by "[0,2] T*T***T** | [1] 1*T***T**"
  • while we're at it, might as well support dotted notation and spaces for readability; e.g. "FT*.***.***"

Approximate Spatial Relationships

A requirement that comes up occasionally is to compute approximate spatial relationships between imprecise data using a distance tolerance.  This is useful in datasets where features don't match exactly due to data inaccuracy. Because RelateNG creates topology only in the neighbourhood of specified points, it should be possible to specify the neighbourhood size using a distance tolerance.  Essentially, vertices and line intersections will "snap together" to determine the effective topology.  Currently within-distance queries are often used to compute "approximate intersects".  Adding a distance tolerance capability to RelateNG will allow other kinds of approximate relationships to be evaluated as well.

Further Performance Optimizations?

A challenge with implementing algorithms over a wide variety of spatial types and use cases is how to provide general-purpose code which matches (or exceeds) the efficiency of more targeted implementations.  RelateNG analyzes the input geometries and the predicate under evaluation to tune strategies to reduce the amount of work needed to evaluate the DE-9IM.  It may be that profiling specific use cases reveals further hotspots in the code which can be improved by additional optimizations.  

Curve Support

GEOS has recently added support for representing geometries with curves.  The RelateNG design is modular enough that it should be possible to extend it to allow evaluating relationships for geometries with curves.