Tuesday, 3 March 2026

Fast Hausdorff Distance and isFullyWithinDistance in JTS

My previous blog post reviewed the concept of the Hausdorff distance (which more descriptively could be called farthest distance.) Despite its usefulness in matching geometric data, there are surprisingly few open-source implementations, and seemingly no efficient ones for linear and polygonal data.  This even includes CGAL and GRASS, which are usually reliable for provising a wide spectrum of geospatial operations.

This lack of high-quality Hausdorff extends to the JTS Topology Suite.  JTS has provided  the DiscreteHausdorffDistance class for many years,  That code is used in many geospatial systems (via JTS, and also its GEOS port), including widely used ones such as PostGIS, Shapely, and QGIS.  However, that imlementation has significant performance problems, as well as other usability flaws (detailed here).

So I'm excited to announce the release of a new fast, general-purpose algorithm for Hausdorff distance in JTS.  It's a class called DirectedHausdorffDistance.  It has the following capabilities:

  • handles all linear geometry types: points, lines and polygons
  • supports a distance tolerance parameter, to allow computing the Hausdorff distance to any desired accuracy
  • can compute distance tolerance automatically, providing a "magic-number-free" API
  • has very fast performance due to lazy densification and indexed distance computation
  • can compute the pair of points on the input geometries at which the distance is attained
  • provides prepared mode execution (caching computed indexes)
  • allows determining farthest points which lie in the interior of polygons
  • handles equal or nearly identical geometries efficiently
  • provides the isFullyWithinDistance predicate, with short-circuiting for maximum performance
Hausdorff distance VS shortest distance between linestrings
The choice of name for the new class is deliberate.  The core of the algorithm evaluates the directed Hausdorff distance from one geometry to another.  To compute the symmetric Hausdorff distance  simply involves choosing the largest of the two directed distances DHD(A, B) and DHD(B, A)  This is provided as the function DirectedHausdorffDistance.hausdorffDistance(a,b).

Indexed Shortest Distance

The Hausdorff distance depends on the standard (Euclidean) shortest distance function (as evident from the mathematical definition:
    DHD(A,B) = max a ∈ A dist(a,B)

A key performance improvement is to evaluate shortest distance using the IndexedFacetDistance class.  For point sets this optimization alone produces a significant boost.

For example, take a case of two random sets of 10,000 points.  DiscreteHausdorffDistance takes 495 msDirectedHausdorffDistance takes only 22 ms - 22x faster.   (It's also worth noting that this is similar performance to finding the shortest distance points using IndexedFacetDistance.nearestPoints).

Lazy Densification

The biggest challenge in computing the Hausdorff distance is that it can be attained at geometry locations which are not vertices. This means that linework edges must be densified to add points at which the distance can be evaluated.  The key to making this efficient is to make the computation "adaptive" by performing "lazy densification".  This avoids densifying edges where there is no chance of the farthest distance occurring.  

Densification is done recursively by bisecting segments.  To optimize finding the location with maximum distance the algorithm uses the branch-and-bound pattern.  The edge segments are stored in a priority queue, sorted by a bounding function giving the maximum possible distance for each segment.  The segment maximum distance is the largest of the distances at the endpoints.  The segment maximum distance bound is the segment maximum distance plus one-half the segment length.  (This is a tight bound.  To see why, consider a segment S of length L, at a distance of D from the target at one end and distance D + e at the other.  The farthest point on S is at a distance of L + 2D + e = L/2 + D + e/2.  This is always less than L/2 + D + e, but approaches it in the limit.)  
Proof of Maximum Distance Bound

The algorithm loops over the segments in the priority queue. The first segment in the queue always has the maximum distance bound. If this is less than the current maximum distance, the loop terminates since no greater distance will be found. If the segment distance to the target geometry is greater than the current maximum distance, it is saved as the new farthest segment.  Otherwise, the segment is bisected, subsegment endpoint distances are computed, and both are inserted back into the queue. 

Search for DHD using line segment bisection

By densifying until bisected segments drop below a given length the directed Hausdorff distance can be determined to any desired accuracy,  The accuracy distance tolerance can be user-specified, or it can be determined automatically.  This provides a "magic-number-free" API, which is significantly improves ease of use.

Performance comparison

Comparing the performance of DirectedHausdorffDistance to DiscreteHausdorffDistance is unfair, since the latter implementation is so inefficient.  However, it's the one currently in use, so the comparison is relevant.   

There are two possible situations.  The first is when the directed Hausdorff distance is attained at vertices (which is often the case when the geometry vertices are already dense; i.e. segment lengths are short relative to the distance).  As an example we will use two polygons of 6,426 and 19,645 vertices.  

DiscreteHausdorffDistance with no densification (a factor of 1) takes 1233 ms. DirectedHausdorffDistance takes 25 ms - 49x faster.  (In practice the performance difference is likely to be more extreme. There is no way to decide a priori how much densification is required for DiscreteHausdorffDistance to produce an accurate answer.  So usually a higher amount of densification will be specified.  This can severely decrease performance.)

The second situation has the Hausdorff distance attained in the middle of a segment, so densification is required.  The query polygon has 468 vertices, and the target has 65.
DirectedHausdorffDistance is run with a tolerance of 0.001, and takes 19 ms.  If DiscreteHausdorffDistance is run with a densification factor of 0.0001 to produce equivalent accuracy, it takes 1292 ms.  If the densification factor is 0.001, the time improves to 155 ms - still 8x slower, with a less accurate answer. 

Handling (mostly) equal geometries

The old Hausdorff distance algorithm had an issue reported in this post on GIS Stack Exchange.  It asks about the slow performance of a case of two nearly-identical geometries which have a very small discrepancy.  In the end the actual problem seemed to be due to the overhead of handling large geometries in PostGIS.  However, testing it with the new algorithm revealed a significant issue.  

Two nearly-identical geometries, showing discrepancy location

It turned out that the new bisection algorithm exhibited very poor performance for this case, and in general for geometries which have many coincident segments. In particular, this applies to computing the Hausdorff distance between two identical geometries.  This situation can easily happen when querying a dataset against itself. So it was essential to solve this problem.  Even worse, detecting the very small discrepancy required an accuracy tolerance of small size, which also leads to bad performance.

The problem is that the maximum distance bounding function depends on both the segment distance and the segment length. When the segment distance is very small (or zero), the distance bound is dominated by the segment length, so subdivision will continue until all segments are shorter than the accuracy tolerance.  This lead to a large number of subsegments being generated during search, particularly when the tolerance is small (as required in the above case).

The solution is to check subsegments with zero distance to see if they are coincident with a segment of the target geometry. If so, there is no need to bisect the segment further, since subsegments must also have distance zero. With this check in place, identical (and nearly so) cases executes as fast as more general cases of the same size.  Equally importantly, this detects very small discrepancies regardless of the accuracy tolerance.

For the record, the GIS-SE case now executes in about 45 ms, and detects the tiny discrepancy of 9 orders of magnitude smaller than the input geometry.

The Hausdorff distance of ~0.00099

Handling Polygonal Input

If the Hausdorff distance is attained at a point lying on an edge then densifying the linework is sufficient.  But for polygonal query geometries the farthest point can occur in the interior of the area: 
The Directed Hausdorff distance is attained at an interior point of the query polygon

To find the farthest interior point the adaptive branch-and-bound approach can be used in the area domain.  Conveniently, JTS already implements this in the MaximumInscribedCircle and LargestEmptyCircle classes (see this blog post.) In particular, LargestEmptyCircle  supports constraining the result to lie inside an area, which is exactly what is needed for the Hausdorff distance.  The target geometry is treated as obstacles, and the polygonal element(s) of the query geometry are the constraints on the location of the empty circle centre.
Directed Hausdorff Distance with multiple area constraints and heterogeneous obstacles

The LargestEmptyCircle algorithm is complex, so it might seem that it could significantly decrease performance.  In fact, it only adds an overhead of about 30%, and for many inputs it's not even noticeable.  Also, if there is no need to determine farthest points in the interior of polygons, this overhead can be avoided by using only polygon linework (i.e. the boundary) as input.  

Currently most Hausdorff distance algorithms operate on point sets, with a very few supporting linear geometry.  There seem to be none which compute the Hausdorff distance for polygonal geometries.  While this might seem an uncommon use case, in fact it's essential to support another new capability of the algorithm: computing the isFullyWithinDistance predicate for polygonal geometries.  

isFullyWithinDistance

Distance-based queries often require determining only whether the distance is less than a given value, not the actual distance value itself.  This boolean predicate can be evaluated much faster than the full distance determination, since the computation can short-circuit as soon as any point is found which confirms being over the distance limit.  It also allows using other geometric properties (such as envelopes) for a quick initial check. For shortest distance, this approach is provided by Geometry.isWithinDistance (and supporting methods in DistanceOp and other classes.). 

The equivalent predicate for Hausdorff distance is called isFullyWithinDistance.  It tests whether all points of a geometry are within a specified distance of another geometry.  This is defined in terms of the directed Hausdorff distance (and is thus an asymmetric relationship):

  isFullyWithinDistance(A,B,d) = DHD(A,B) <= d 

The DirectedHausdorffDistance class provides this predicate via the isFullyWithinDistance(A,B,dist) function.  Because the new class supports all types of input geometry (including polygons), the predicate is fully general.  For even faster performance in batch queries it can be executed in prepared mode via the isFullyWithinDistance(A,dist) method.  This mode caches the spatial indexes built on the target geometry so they can be reused.

For a performance example, consider a dataset of European boundaries (countries and islands) containing about 28K vertices.  The boundary of Germany is used as the target geometry.


If isFullyWithinDistance is run with a distance limit of 20, it takes about 60 ms.  

There's no direct comparison for DiscreteHausdorffDistance, but if that class is used to compute the directed Hausdorff distance with a conservative densification factor of 0.1, the time is about 1100 ms.  Another point of comparison is to run a shortest distance query.  This takes only 21 ms - but it's doing much less work.

A better implementation for ST_DFullyWithin

Another way to implement isFullyWithinDistance is to compute the buffer(d) of geometry B and test whether it covers A:

  isFullyWithinDistance(A,B,d) = B.buffer(d).covers(A) 

This is how the ST_DFullyWithin function in PostGIS works now.  It's a reasonable design choice given the current lack of a performant Hausdorff distance implementation.  However, there are a few problems with using buffer:
  • Buffers of complex geometry can be slow to compute, especially for large distances
  • There's a chance of robustness bugs affecting the computed buffer
  • Buffers are linearized approximations, so there is a likelihood of false negatives for query geometries which lie close to the buffer boundary
- image of buffer quantization causing failures

Now, the DirectedHausdorffDistance implementation of isFullyWithinDistance can make this function faster, more accurate, more robust and cacheable.  (And of course, the ST_HausdorffDistance function can benefit as well.)

Summary

The JTS DirectedHausdorffDistance class provides fast, cacheable, easy-to-use computation of Hausdorff distances and the isFullyWithinDistance predicate for all JTS geometry types.  This is a major improvement over the old JTS DiscreteHausdorffDistance class, and essentially fully replaces it.  More generally, it fills a notable gap in open-source geospatial functionality.  It will allow many systems to provide a high-quality implementation for Hausdorff distance.


Monday, 23 February 2026

The Hausdorff Distance Challenge

The Hausdorff Distance is a useful spatial function which can appear slightly mysterious. Partly this is due to the name.  It honours Felix Hausdorff, one of the founding fathers of topology, and a polymath who was creative in music and literature as well as mathematics.    

Felix Hausdorff  (1868-1942)

But the name conveys nothing about why this function is useful, or how it is different to the more familiar shortest distance.  The key difference is: the shortest distance tells you how close things are, but the Hausdorff distance tells you how far apart they are. So a more descriptive name might be "farthest distance" or "maximum distance".  With due respect to Dr. Hausdorff, this is one of those historical artifacts of nomenclature that deserves a refresh.  (Especially since it's becoming recognized that the core concept was actually first published by the Romanian mathematician Dimitrie Pompeiu.  Users of the future will be grateful to be spared invoking the ST_PompeiuHausdorffDistance function.)

Definition

The formal definition of the Hausdorff distance (HD) is 

     HD(A,B) = max( DHD(A,B), DHD(B,A) )

where DHD is the directed Hausdorff Distance (DHD):

     DHD(A,B) = max a ∈ A dist(a,B)

with dist(a,B) being the usual shortest distance between point a and geometry B:

     dist(a,B) =  min b ∈ B dist(a,b) 

The Hausdorff distance is symmetric and is a true distance metric.  The directed Hausdorff distance is asymmetric.  Both can be useful in different contexts.  The directed version is arguably more fundamental.  (It's certainly where the bulk of the implementation effort lies.)
Directed Hausdorff Distance is asymmetric

The main application of the Hausdorff distance is in determining how well two datasets match, by providing a measure of their similarity.  In spatial applications these are typically geometries such as lines or polygons, but they can also be point clouds or raster images.  The Hausdorff distance is much more useful than shortest distance as a similarity measure because it gives information about all the points in a shape, not just a single closest point.  While shortest distance puts a bound on how far a single point is from the target, the Hausdorff distance is a bound on every point in the query shape. So in the figure below the two lines have a small shortest distance, but the Hausdorff distance reveals that they are actually far apart at some points.
Hausdorff Distance VS Shortest Distance

The Implementation Challenge

A key difference between shortest distance and Hausdorff distance is that the pair of points defining the shortest distance always includes at least one vertex, whereas the Hausdorff distance can occur at non-vertex points. For lines, the Hausdorff distance can occur anywhere on the edges: 

For polygons it can occur on edges or in the interior of the query area:

This makes the Hausdorff distance substantially harder to implement for general 2D geometries.  While the shortest distance can be determined simply by evaluating the distance at the finite set of vertices on each geometry, the Hausdorff distance requires a way to evaluate a finite set of points out of the infinite number of non-vertex points. 

Perhaps this is why it's so hard to find an implementation of Hausdorff distance for general 2D geometry.  (Or is there just no need for a fast accurate general-purpose Hausdorff distance?  Surely not...)  There's some implementations for point sets, and at least one for the specific case of convex polygons.  There's a couple which may support lines (here and here), but in a seemingly crude way.  And I haven't found a single one for general polygons.  Excellent - it's good to have a challenge!

Discrete Hausdorff Distance

A simple approach is to discretize the input linework by densifying the linework.  The Hausdorff distance is then evaluated over the original and added vertices. The JTS Topology Suite class DiscreteHausdorffDistance implements this approach.  This algorithm was developed many years ago (2008) for use in the RoadMatcher linear network conflation tool.  It worked well enough for that use case, since inputs were typically small and the accuracy was "good enough".  But it has some serious problems:
  • achieving accuracy requires a high degree of densification of every edge, which means slow performance
  • if the Hausdorff distance occurs at a vertex, then densification is not needed, but this is impossible to determine a priori
  • the user generally has no idea what level of densification (if any) is required to determine a result of required accuracy (this is particularly problematic in automated batch processing, where geometries may require varying amounts of densification)
  • the use of a densification factor rather than a maximum segment length was a mistake.  It is hard to determine the factor needed for a desired distance accuracy, and it causes over-densification of short edges
  • it is very slow when the inputs are equal or very similar (as shown in this issue)
  • polygonal inputs are not supported
  • the internal shortest distance computation is inefficient, since it does not use an indexed algorithm
Some of these flaws could be fixed.  For instance, shortest distance computation can be improved by using IndexedFacetDistance (which was not available at the time of development). And densification could be controlled by a maximum segment length instead of a factor.  But addressing all these issues requires a fundamental rethinking of the algorithm. 

Given the wide deployment of JTS and its C++ port GEOS, any improvement stands to benefit a huge number of users.  And after 18 years it's high time this clunky old code was replaced.  So I'm happy to announce that I'm working on an entirely new implementation for Hausdorff distance which solves all the issues above.  Expect a blog post soon!

Tuesday, 8 July 2025

Fast detection of narrow polygons with JTS

Some spatial use cases require identifying "narrow" or "skinny" polygons.  A classic example is the process of cleaning polygonal coverages. Coverages (even clean ones) may contain narrow gaps between polygons which are unwanted.  But they may also contain larger gaps which are valid.  A geometric test is required to distinguish narrow polygons which are gaps (that can be merged) from wider ones (which should be retained). 

Narrowness Measures

There are various criteria which can be used to determine if a polygon is "narrow".  One approach is to use an estimate of circularity (also called compactness or roundness) such as the Polsby-Popper measure.  This is a measure based on the ratio of polygon area to the square of its perimeter.  Polygons with a measure closer to 1 are more circular than ones with lower values (which are thus more likely to be narrow).  This is easy to implement and fast to compute, but it has significant limitations:

  • it is not obvious how to pick a measure value that determines "narrowness"
  • it is based on shape-wide metrics (area and perimeter), so it may not accurately reflect the local properties of a shape (i.e. a polygon may have a low circularity measure and still contain a wide "bulge" in one place.)
  • because it is a ratio it is size-independent, so very small gaps (which should be removed) may be compact according to the measure, and are thus retained. (Adding an area limit can mitigate this, but that introduces yet another difficult-to-choose parameter.)

Maximum Inscribed Circle isRadiusWithin

Really, the best way is to determine the narrowness of a polygon is to compute its actual width.  An effective way to do this is to compute the diameter of the largest circle which lies within the polygon.  This is known as the Maximum Inscribed Circle.  The JTS Topology Suite provides the MaximumInscribedCircle class to compute the centre point and radius of the required circle (see the blog post). (The algorithm is also available in downstream projects including GEOS, PostGIS and Shapely.)  The algorithm uses an iterative approximation, which currently requires specifying an approximation tolerance distance.

However, the real requirement is only to test whether the MIC radius is less than a given length value.  The actual value of the radius is not required.  Generally, an algorithm that tests for a value can be faster and more robust than one that computes the value explicitly.  This is a perfect example of this principle.  By formulating the function as a test against a provided radius length execution can be much faster, since computation can terminate as soon as it is known that the radius must be smaller or larger than the required length.  (Determining that the MIC radius is larger than the limit is easy: an internal point is found with distance to the polygon boundary which is greater than the limit.  The radius is known to be smaller than the limit once the grid size used to generate internal points is less than one-half the limit value.  If neither of those situations occurs the approximation tolerance is used to terminate the computation.)  This has recently been implemented as the MaximumInscribedCircle.isRadiusWithin predicate.  

A further benefit of the isRadiusWithin implementation is that it removes the need to specify the approximation tolerance.  The tolerance can be determined automatically, as a fraction of the specified radius limit.  This is significant, because selecting an appropriate distance tolerance for a Maximum Inscribed Circle computation is difficult, since it has no relationship to the size of the input geometry (i.e. a geometry may have a large extent, but still be very narrow.)  This is particularly challenging when a single tolerance must be chosen for a set of inputs of varying shape - precisely the use case for which this is being built.

Demonstration

As expected, evaluating the isRadiusWithin predicate is much faster than computing the actual value of the Maximum Inscribed Circle radius.  For a demonstration we use this dataset of world country borders (exploded to treat each island as a separate polygon).   This dataset has 8,393 polygons containing a total of 366,951 vertices.  

Computing the Maximum Inscribed Circle radius for each polygon, with a tolerance of 0.00001 degrees, takes 2.68 seconds.  

Evaluating the isRadiusWithin predicate, using a distance tolerance of 1 degree, executes in 0.034 seconds - almost 80 times faster.  


Usage in Coverage Cleaning

This function is used in the CoverageCleaner class to provide a fast test for narrow polygons (as specified by the user using the gapWidth parameter).  To see how this works on a real use case,  consider a polygonal coverage dataset of communes in Departement 71 in France (Saône-et-Loire).  The dataset has 564 polygons, with 612,238 vertices. As shown in a previous post on coverage cleaning, this dataset has highly non-aligned linework.  This creates lots of gaps.  The gaps can be found by cleaning the coverage with a gap tolerance distance of 0, which reveals 28,427 gaps, with 349,042 vertices.

Departement 71 communes showing the many gaps present

Close-up of gaps (note scale)

Computing the Maximum Inscribed Circle radius explicitly for each gap takes 25.4 seconds.  In contrast, using isRadiusWithin with a distance of 0.5 takes only 1.35 seconds - 18 times faster.  Since the cleaning process itself takes only 4.9 seconds, the reduction in time for the gap merging phase is significant.  

Future Work

It would be very nice to extend the automatic tolerance setting to the Maximum Inscribed Circle computation.  More research is required to see whether this can be done.

Wednesday, 30 April 2025

Coverage Cleaning in JTS

The JTS Topology Suite has been rolling out the capability to manage polygonal coverages.  It supports modelling polygonal coverages as arrays of discrete polygonal geometries.  This is simple to work with and allows using all of the wide variety of JTS algorithms.  Coverage topology allows operations such as CoverageSimplifier and CoverageUnion to be highly performant and effective.

The key requirement for coverage operations is that the input is a valid polygonal coverage.  JTS provides the CoverageValidator to check a dataset to see if it is a valid coverage, and provide full information on the location of topology errors.  However, this leaves the question of how to convert a dataset with errors into a valid coverage.  

There are two options for fixing coverages: manual editing, and automated cleaningCoverageValidator provides information to identify the location of all errors. However, manually fixing them is time-consuming, and difficult in environments which lack convenient editing tools. And of course it simply isn't feasible in automated workflows.

What users really want is automated coverage cleaning.  This is a major pain point (witness the numerous questions about how to do this; for example, this and this and this).  Clearly, cleaning coverages is considered critical!

There are several existing tools to do this:

  • mapshaper is perhaps the most popular tool currently.  It supports snapping to improve dataset quality, along with overlap and gap merging.  It is written in Javascript, which provides surprisingly good performance, but makes it awkward to integrate in other tools.
  • pprepair is an academically-advanced algorithm which uses triangulation as the basis for cleaning.  It relies on CGAL, which limits appeal for integration
  • GRASS v.clean is a long-standing cleaning tool with an extensive set of options for carrying out various kinds of cleaning

But in the JTS/GEOS ecosystem it's been a gap that needs filling (so to speak).  So I'm excited to announce the the arrival of the JTS CoverageCleaner.

JTS CoverageCleaner class

The JTS CoverageCleaner class provides the essential capabilities of cleaning polygonal data to form a valid coverage:

  • Snapping: snapping vertices and lines eliminates small discrepancies and narrow gaps, and improves the robustness of subsequent processing
  • Overlap Merging: of two or more polygons are merged with a neighbouring polygon.  The merge target can be chosen to be the neighbour with the longest shared border, the largest or smallest area, or the one with minimum input index (which allows priority-based merging).
  • Gap Detection and Merging: Gaps are defined as enclosed empty areas with width less than a distance tolerance.  Width is computed as the diameter of the Maximum Inscribed Circle of the gap polygon. (This is an improvement over other tools, which only offer cruder area or roundness-metric tests to identify gaps to be merged.) Gaps can be filled and merged with an adjacent polygon. 
The output of the cleaner is a valid polygonal coverage that contains no overlaps and no gaps narrower than the given tolerance.  The merged overlaps and gaps are also available for visualization and QA purposes.

Other aspects of cleaning data are provided by separate JTS classes.  Invalid polygonal geometries can be preprocessed with GeometryFixer to repair them.  CoverageSimplifier can be used to reduce the vertex size of cleaned datasets.  And a forthcoming coverage precision reducer will allow controlling the amount of precision maintained (which can reduce the space required for some storage formats).

Algorithm Description

This is a good example of how the breadth of JTS capabilities makes it easier to develop new algorithms:

  • the SnappingNoder was developed for OverlayNG, and turned out to be ideal for eliminating minor errors and providing a robust basis for the rest of the process
  • the Polygonizer allows fast recreation of a coverage topology from the noded linework
  • Spatial indexes (STRtree and Quadtree) and the fast predicates of RelateNG provide performance for operating in the discrete geometric domain
  • the new fast MaximumInscribedCircle.isRadiusWithin predicate allows using polygon width for gap detection
The cleaning algorithm works as follows:

Input

The input is an array of valid polygonal geometries.  This is the same structure used in other coverage operations.  However, it may or may not not have valid coverage topology.  The array position acts as an implicit identifier for the coverage elements.

Snapping and Noding

Input linework is extracted and noded using the SnappingNoder.  Snapping vertices and lines using a distance tolerance eliminates small discrepancies and narrow gaps, and improves the robustness of noding.  By default a very small snapping distance is used (based on the diameter of the dataset.)  The user can specifiy a custom snapping distance. Larger distances can be more effective at cleaning the linework.  However, too large a distance can introduce unwanted alteration to the data.

Polygonizing and Resultant Categorization

The noded linework is polygonized using the Polygonizer. This forms a clean topology of resultant polygons.  Resultant polygons are associated with their parent input polygon(s) via their interior points and point-in-polygon tests. Resultants are categorized according to their number of parents:
  • Result area:  one parent (which identifies the associated input polygon)
  • Overlap: multiple parents
  • Gap: no parents

Overlap Merging

Overlaps are always merged, since they violate coverage topology.  They are merged with an adjacent result area chosen via a specified merge strategy.  The strategy indicates whether the merge target is chosen to be the neighbour with the longest shared border, the largest or smallest area, or the one with minimum input index (which allows priority-based merging).

Gap Detection and Merging

Gaps in the created coverage do not invalidate coverage topology, but they are often an artifact of invalidities, or may be errors in the original linework.  For this reason a distance tolerance can be used to select gaps to be merged.  Gaps which are narrower than the width tolerance are merged with the neighbour result area with longest shared border.  The width of the gap polygon is determined by the diameter of the MaximumInscribedCircle.  (This is an improvement over other tools, which only provide cruder area or roundness-metric tests to identify gaps.)
Identifying narrow gaps to be merged via Maximum Inscribed Circle radius

Output

The output is an array of the same size as the input array.  Each element is the cleaned version of the corresponding input element, and the entire array forms a valid polygonal coverage.  Due to snapping very small inputs may disappear.  The output may still contain gaps larger than the gap width tolerance used.  (One way to to detect these is to union the coverage - the result will have holes where gaps exist.) Cleaning can be re-run with a larger tolerance to remove more gaps.

Examples

Given the number of tools available to clean polygonal coverages, it's slightly surprising that you don't have to look hard to find datasets containing coverage topology errors.  But that's good for testing!  Here are some examples demonstrating the behaviour and performance of CoverageCleaner.

Example 1: Montana County Commissioner Districts

This is one of the jurisdiction coverages available here.  The dataset contains 177 areas with 470,229 vertices.  Running CoverageValidator finds errors at over 12,000 locations:
Zooming in on an area containing errors using the JTS TestBuilder Magnify Topology tool shows that errors are caused by small misalignments in the linework of adjacent polygons, creating both gaps and overlaps:

Running the Coverage Cleaner with the default snapping tolerance and a gap width of 10 produces a clean coverage in 0.88 seconds.  The effect of snapping removes 2043 overlaps and 1793 gaps.  After topology building, merging removes a further 14 overlaps and 12 gaps. (This is determined by running the cleaning process with both default snapping tolerance and a tolerance of 0, and counting the merged overlaps and gap polygons available from the cleaner instance.)

Example 2: Communes in France Department 71

This example is a dataset representing communes in the French department 71 (Saône-et-Loire).  It may have been assembled by aggregating cadastral parcels, which is a common cause of discrepancies due to different data origins.  The dataset contains 564 areas with 612,238 vertices.  Validating the coverage reveals over 362,000 error locations (and in fact this doesn't represent all of the gaps present).

Zooming shows some example overlaps and gaps. Magnify Topology is not needed since the discrepancies are on the order of metres in size.

Running the Coverage Cleaner with default snapping tolerance and a Gap Width of 50 produces a clean coverage in 23 seconds.  The slower performance over Example 1 is due to the much larger number of gaps and overlaps: 29,307 overlaps and 28,409 gaps.  They are almost all too wide to be handled by snapping, so they must be merged, which requires more processing.

Here's a before-and-after view of an area of the cleaned coverage, showing the effect of merging gaps and overlaps:



Gap Merging Quality Issues

Given the size and number of issues in the France-71-communes dataset, it's not surprising that it exhibits two known problems with the current gap merging strategy: spikes creating by merged gaps, and unmerged narrow gap parts.

Here is an example of how merging a gap to an adjacent polygon can create a spike in the result polygon.  (Merging overlaps does not produce these kinds of errors, since the overlap was part of the merge target originally.)

This is an example of a wide gap which is not merged, but which has narrow portions which could be "clipped off" and merged:

Improving the quality of gap merging is an area for further research and development.

GEOS and PostGIS Ports

As usual, this work will be ported to GEOS soon, and then exposed in PostGIS as a new coverage function,  It should also show up in downstream projects such as Shapely and hopefully QGIS.

Further Enhancements

Overlap and Gap Handling

There are other possibilities for controlling overlap merging and gap filling.  Choosing merge targets could be determined by a priority value on each input area.  Gaps could be determined by area or roundness measure criteria, as provided in other systems (although they seem less precise than using polygon width.)     

Gap Partitioning and Spike Removal 

Gaps which are very long can extend along multiple polygons. If a gap of this nature is merged to single polygon it will create a spike in the merged polygon.  Instead, it would be better to partition the gap into multiple pieces at polygon nodes and merge the parts separately.

A more complex situation is when a gap contains both narrow and wide portions.  An example of this is a parcel coverage with blocks separated by narrow streets and wider intersections and squares.  Ideally the narrow parts could be partitioned off and merged, leaving the wide portions unmerged.  

Gore Removal

Gores are narrow "inlets" along the border of in a coverage.   They have a similar shape to gaps, but since they are not closed they are not identified as gaps by the cleaning process.  They could be identified by a different process, closed off, and then merged (ideally with partitioning as discussed above).

Coverage precision reduction

A common requirement in data management is to reduce the precision of data coordinates.  For coverages this cannot be done piecewise, but must be performed as a global operation over the entire coverage, since rounding coordinates independently can create line intersections.  The SnapRoundingNoder provides an effective way of doing this.


Monday, 27 May 2024

RelateNG Performance

A previous post introduced a new algorithm in the JTS Topology Suite called RelateNG.  It computes topological relationships between geometries using the Dimensionally-Extended 9 Intersection Model  (DE-9IM) model.  

This algorithm is fundamental to a large proportion of spatial queries executed in numerous geospatial environments.  It would not be surprising to learn that the Relate algorithm is executed billions of times per day across the world's data centres.  Given this, performance is a critical metric.  

The original JTS RelateOp algorithm was hamstrung by its design, which required computing a full topology graph for each relationship evaluation.  The subsequent development of PreparedGeometry provided a significant performance boost for common spatial predicates such as intersects and covers.  A useful side-effect is that it is less susceptible to geometric robustness problems.  However, it has some drawbacks:

  • it only implements a small set of predicates
  • the design does not easily extend to more complex predicates
  • it does not support GeometryCollection inputs
  • it is a completely separate codebase to the general-purpose RelateOp, which increases maintenance effort and increases the risk of bugs and behaviour discrepancies
RelateNG solves all of these problems.  In addition, it provides much better performance for non-prepared (stateless) calls.  This is important for systems which do not execute spatial queries in a stateful way (due to architectural limitations, or simply lack of attention to performance), and thus cannot use prepared mode.

Performance Tests

Here are some examples of performance metrics.  
Note that the data used in the tests is sometimes subsetted to avoid skewing the results with many false results due to non-interaction between the input geometries.  This reflects the typical deployment of these predicates in systems which perform a primary extent filter (i.e. via a spatial index) before executing the full predicate.
Some of datasets are in geodetic coordinates, which strictly speaking JTS does not handle. But this does not matter for the purposes of performance testing.

Point / Polygon

This test queries a MultiPolygon of the outline of Europe (including larger islands) against a synthetic dataset of random points.  The test evaluates the intersects predicate, which is really the only one of interest in the Point / Polygon case.  

The test data metrics are:
  • Query MultiPolygon: 461 polygons, 22,675 vertices
  • Target Points: 10,000 points

Timings

OpIntersectsIntersects Prep
Relate61.7 s21 ms
RelateNG0.1 s19 ms

The result clearly shows the enormous improvement in the stateless case, since RelateNG does not build topology on the input polygon.  The results are very similar in the prepared case.  This is as expected, since both codebases run a simple Point-in-Polygon test using a cached spatial index.

Line / Line

This test uses a dataset of major rivers of the continental United States.  A single river mainstem is queried against a subset of river branches using the intersects and touches relationships, in stateless and prepared modes.

The test data metrics are:
  • Query Line: 6,975 vertices
  • Target Lines: 407 lines with 47,328 vertices
 

Timings (per 100 executions of the operation)

OpIntersectsIntersects PrepTouchesTouches Prep
Relate38.2 s133 ms36 sN/A
RelateNG1.18 s142 ms2.05 s2.03 s

This shows the much better performance of RelateNG.  RelateNG can evaluate touches in prepared mode, but the performance is similar to stateless mode, since currently indexes are not cached for Line/Line cases.  This could be improved in a future release.

Polygon / Polygon

This test uses data from two polygonal datasets:
The tests query an administrative unit polygon against a subset of Bedrock polygons which intersect its extent, using intersects and covers in stateless and prepared modes. 

The test data metrics are:
  • GADM unit: 4,017 vertices
  • Bedrock polygons: 4,318 polygon with 337,650 vertices

Timings (per 100 executions of the operation)

OpIntersectsIntersects PrepCoversCovers Prep
Relate61.7 s534 ms54.9 s842 ms
RelateNG5.8 s595 ms6.4 s943 ms

Polygon / Polygon - Custom Relate Pattern

The tests shows the ability of RelateNG to efficiently evaluate arbitrary Relate Intersection Matrix patterns.  In this case, the pattern used is F***0****, which corresponds to a relationship which could be called "point-touches": two geometries have boundaries which intersect in only one (or more) points (dimension 0), but whose interiors do not intersect.

This test uses data from The GADM Level 1 boundaries for Canada.  This has the advantage that Canada contains a rare example of 4 boundaries meeting at a single point (the jurisdictions of Saskatchewan, Manitoba, Northwest Territories, and Nunavut).

The test data metrics are:
  • GADM Canada Level 1: 13 polygons, 4,005,926 vertices

Timings

OpF***0****F***0**** Prep
Relate504 sN/A
RelateNG9.8 s6.6 s

As expected, for this test RelateNG is far more performant than Relate.  This is due to its ability to analyze Intersection Matrix patterns and perform only topology tests that are necessary, as well as not building a full topological structure of the inputs.  The test shows the effect of caching spatial indexes in prepared mode, although stateless mode is very efficient as well.

Analysis of Results

Clearly RelateNG is far more performant than Relate when running in non-prepared mode.  The PreparedGeometry implementation is slightly faster (which confirms the efficiency of its original design), but not by very much.  The difference may be a consequence of the general-purpose and thus more complex code and data structures in RelateNG. Closing this gap may be an area for future research.
One thing is certain: the RelateNG algorithm design provides much more scope for adding case-specific optimizations.  If you have a significant use case which could be improved by further targeted optimization improvements, let me know in the comments!
It will be interesting to re-evaluate these tests once RelateNG is ported to GEOS.  Sometimes (but not always) a C++ implementation can be significantly faster than Java, due to more opportunities for code and compiler optimization.

Thursday, 16 May 2024

JTS Topological Relationships - the Next Generation

The most fundamental and widely-used operations in the JTS Topology Suite are the ones that evaluate topological relationships between geometries.  JTS implements the Dimensionally-Extended 9 Intersection Model (DE-9IM), as defined in the OGC Simple Features specification, in the RelateOp API.  

DE-9IM matrix for overlapping polygons

The RelateOp algorithm was the very first one implemented during the initial JTS development, over 20 years ago.  At that time it was an appealing idea to implement a general-purpose topology framework (the GeometryGraph package), and use it to support topological predicates, overlay, and buffering.  However, some disadvantages of this approach have become evident over time:

  • the need to create a topological graph structure limits the ability to improve performance. This has led to the implementation of PreparedGeometry - but that adds further complexity to the codebase, and supports only a limited set of predicates.
  • a large number of code dependencies make it hard to fix problems and improve semantics
  • constructing a full topology graph increases exposure to geometric robustness errors
  • GeometryCollections are not supported (initially because the OGC did not define the semantics for this, and now because adding this capability is difficult)
There's an extensive list of issues relating (ha!) to RelateOp here.

The importance of this functionality is especially significant since the same algorithm is implemented in GEOS.  That codebase is used to evaluate spatial queries in popular spatial APIs such as Shapely and R-sf, and numerous systems such as PostGISDuckDBSpatialLiteQGIS, and GDAL (to name just a few).  It would not be surprising to learn that the RelateOp algorithm is executed billions of times per day across the world's CPUs.

During the subsequent years of working on JTS I realized that there was a better way to evaluate topological relationships.  It would required a ground-up rewrite, but would avoid the shortcomings of RelateOp and provide better performance and a more tractable codebase.  Thanks to my employer Crunchy Data I have finally been able to make this idea a reality.  Soon JTS will provide a new algorithm for topological relationships called RelateNG.

Key Features of RelateNG

The RelateNG algorithm incorporates a broad spectrum of improvements over RelateOp in the areas of functionality, robustness, and performance.  It provides the following features:

  • Efficient short-circuited evaluation of topological predicates (including matching custom DE-9IM matrix patterns)
  • Optimized repeated evaluation of predicates against a single geometry via cached spatial indexes (AKA "prepared mode")
  • Robust computation (only point-local geometry topology is computed, so invalid topology does not cause failures)
  • GeometryCollection inputs containing mixed types and overlapping polygons are supported, using union semantics.
  • Zero-length LineStrings are treated as being topologically identical to Points.
  • Support for BoundaryNodeRules.

Using the RelateNG API

The main entry point is the RelateNG class.  It supports evaluating topological relationships in three different ways:

  • Evaluating a standard OGC named boolean binary predicate, specified via a TopologyPredicate instance.  Standard predicates are obtained from the RelatePredicate factory functions intersects, contains, overlaps, etc.
  • Testing an arbitrary DE-9IM relationship by matching an intersection matrix pattern (e.g. "T**FF*FF*", which is the pattern for a relation called Contains Properly).
  • Computing the full value of a DE-9IM IntersectionMatrix.
RelateNG operations can be executed in two modes: stateless and prepared.  Stateless mode is provided by the relate static functions, which accept two input geometries and evaluate a predicate.  For prepared mode, a single geometry is provided to the prepare functions to create a RelateNG instance.  Then the evaluate instance methods can be used to evaluate spatial predicates on a sequence of geometries. The instance creates spatial indexes (lazily) to make topology evaluation much more efficient.  Note that even stateless mode can be significantly faster than the current implementation, due to short-circuiting and other internal heuristics.

Here is an example of matching an intersection matrix pattern, in stateless mode:
boolean isMatched = RelateNG.relate(geomA, geomB, "T**FF*FF*");
Here is an example of setting up a geometry in prepared mode, and evaluating a named predicate on it:
RelateNG rng = RelateNG.prepare(geomA);
for (Geometry geomB : geomSet) {
    boolean predValue = rng.evaluate(geomB, RelatePredicate.intersects());
}

Rolling It Out

It's exciting to launch a major improvement on such a core piece of spatial functionality.  The Crunchy spatial team will get busy on porting this algorithm to GEOS.  From there it should get extensive usage in downstream projects.  We're looking forward to hearing feedback from our own PostGIS clients as well as other users.  We're always happy to be able to reduce query times and equally importantly, carbon footprints.  

In further blog posts I'll describe the RelateNG algorithm design and provide some examples of performance metrics.

Future Ideas

The RelateNG implementation provides an excellent foundation to build out some interesting extensions to the fundamental DE-9IM concept.

Extended Patterns

The current DE-9IM pattern language is quite limited.  In fact, it's not even powerful enough to express the standard named predicates.  It could be improved by adding features like:

  • disjunctive combinations of patterns.  For example, touches is defined by  "FT******* | F**T***** | F***T****"
  • dimension guards to specify which dimensions a pattern applies to.  For example, overlaps is defined by "[0,2] T*T***T** | [1] 1*T***T**"
  • while we're at it, might as well support dotted notation and spaces for readability; e.g. "FT*.***.***"

Approximate Spatial Relationships

A requirement that comes up occasionally is to compute approximate spatial relationships between imprecise data using a distance tolerance.  This is useful in datasets where features don't match exactly due to data inaccuracy. Because RelateNG creates topology only in the neighbourhood of specified points, it should be possible to specify the neighbourhood size using a distance tolerance.  Essentially, vertices and line intersections will "snap together" to determine the effective topology.  Currently within-distance queries are often used to compute "approximate intersects".  Adding a distance tolerance capability to RelateNG will allow other kinds of approximate relationships to be evaluated as well.

Further Performance Optimizations?

A challenge with implementing algorithms over a wide variety of spatial types and use cases is how to provide general-purpose code which matches (or exceeds) the efficiency of more targeted implementations.  RelateNG analyzes the input geometries and the predicate under evaluation to tune strategies to reduce the amount of work needed to evaluate the DE-9IM.  It may be that profiling specific use cases reveals further hotspots in the code which can be improved by additional optimizations.  

Curve Support

GEOS has recently added support for representing geometries with curves.  The RelateNG design is modular enough that it should be possible to extend it to allow evaluating relationships for geometries with curves.