Tuesday, 3 March 2026

Fast Hausdorff Distance and isFullyWithinDistance in JTS

My previous blog post reviewed the concept of the Hausdorff distance (which more descriptively could be called farthest distance.) Despite its usefulness in matching geometric data, there are surprisingly few open-source implementations, and seemingly no efficient ones for linear and polygonal data.  This even includes CGAL and GRASS, which are usually reliable for provising a wide spectrum of geospatial operations.

This lack of high-quality Hausdorff extends to the JTS Topology Suite.  JTS has provided  the DiscreteHausdorffDistance class for many years,  That code is used in many geospatial systems (via JTS, and also its GEOS port), including widely used ones such as PostGIS, Shapely, and QGIS.  However, that imlementation has significant performance problems, as well as other usability flaws (detailed here).

So I'm excited to announce the release of a new fast, general-purpose algorithm for Hausdorff distance in JTS.  It's a class called DirectedHausdorffDistance.  It has the following capabilities:

  • handles all linear geometry types: points, lines and polygons
  • supports a distance tolerance parameter, to allow computing the Hausdorff distance to any desired accuracy
  • can compute distance tolerance automatically, providing a "magic-number-free" API
  • has very fast performance due to lazy densification and indexed distance computation
  • can compute the pair of points on the input geometries at which the distance is attained
  • provides prepared mode execution (caching computed indexes)
  • allows determining farthest points which lie in the interior of polygons
  • handles equal or nearly identical geometries efficiently
  • provides the isFullyWithinDistance predicate, with short-circuiting for maximum performance
Hausdorff distance VS shortest distance between linestrings
The choice of name for the new class is deliberate.  The core of the algorithm evaluates the directed Hausdorff distance from one geometry to another.  To compute the symmetric Hausdorff distance  simply involves choosing the largest of the two directed distances DHD(A, B) and DHD(B, A)  This is provided as the function DirectedHausdorffDistance.hausdorffDistance(a,b).

Indexed Shortest Distance

The Hausdorff distance depends on the standard (Euclidean) shortest distance function (as evident from the mathematical definition:
    DHD(A,B) = max a ∈ A dist(a,B)

A key performance improvement is to evaluate shortest distance using the IndexedFacetDistance class.  For point sets this optimization alone produces a significant boost.

For example, take a case of two random sets of 10,000 points.  DiscreteHausdorffDistance takes 495 msDirectedHausdorffDistance takes only 22 ms - 22x faster.   (It's also worth noting that this is similar performance to finding the shortest distance points using IndexedFacetDistance.nearestPoints).

Lazy Densification

The biggest challenge in computing the Hausdorff distance is that it can be attained at geometry locations which are not vertices. This means that linework edges must be densified to add points at which the distance can be evaluated.  The key to making this efficient is to make the computation "adaptive" by performing "lazy densification".  This avoids densifying edges where there is no chance of the farthest distance occurring.  

Densification is done recursively by bisecting segments.  To optimize finding the location with maximum distance the algorithm uses the branch-and-bound pattern.  The edge segments are stored in a priority queue, sorted by a bounding function giving the maximum possible distance for each segment.  The segment maximum distance is the largest of the distances at the endpoints.  The segment maximum distance bound is the segment maximum distance plus one-half the segment length.  (This is a tight bound.  To see why, consider a segment S of length L, at a distance of D from the target at one end and distance D + e at the other.  The farthest point on S is at a distance of L + 2D + e = L/2 + D + e/2.  This is always less than L/2 + D + e, but approaches it in the limit.)  
Proof of Maximum Distance Bound

The algorithm loops over the segments in the priority queue. The first segment in the queue always has the maximum distance bound. If this is less than the current maximum distance, the loop terminates since no greater distance will be found. If the segment distance to the target geometry is greater than the current maximum distance, it is saved as the new farthest segment.  Otherwise, the segment is bisected, subsegment endpoint distances are computed, and both are inserted back into the queue. 

Search for DHD using line segment bisection

By densifying until bisected segments drop below a given length the directed Hausdorff distance can be determined to any desired accuracy,  The accuracy distance tolerance can be user-specified, or it can be determined automatically.  This provides a "magic-number-free" API, which is significantly improves ease of use.

Performance comparison

Comparing the performance of DirectedHausdorffDistance to DiscreteHausdorffDistance is unfair, since the latter implementation is so inefficient.  However, it's the one currently in use, so the comparison is relevant.   

There are two possible situations.  The first is when the directed Hausdorff distance is attained at vertices (which is often the case when the geometry vertices are already dense; i.e. segment lengths are short relative to the distance).  As an example we will use two polygons of 6,426 and 19,645 vertices.  

DiscreteHausdorffDistance with no densification (a factor of 1) takes 1233 ms. DirectedHausdorffDistance takes 25 ms - 49x faster.  (In practice the performance difference is likely to be more extreme. There is no way to decide a priori how much densification is required for DiscreteHausdorffDistance to produce an accurate answer.  So usually a higher amount of densification will be specified.  This can severely decrease performance.)

The second situation has the Hausdorff distance attained in the middle of a segment, so densification is required.  The query polygon has 468 vertices, and the target has 65.
DirectedHausdorffDistance is run with a tolerance of 0.001, and takes 19 ms.  If DiscreteHausdorffDistance is run with a densification factor of 0.0001 to produce equivalent accuracy, it takes 1292 ms.  If the densification factor is 0.001, the time improves to 155 ms - still 8x slower, with a less accurate answer. 

Handling (mostly) equal geometries

The old Hausdorff distance algorithm had an issue reported in this post on GIS Stack Exchange.  It asks about the slow performance of a case of two nearly-identical geometries which have a very small discrepancy.  In the end the actual problem seemed to be due to the overhead of handling large geometries in PostGIS.  However, testing it with the new algorithm revealed a significant issue.  

Two nearly-identical geometries, showing discrepancy location

It turned out that the new bisection algorithm exhibited very poor performance for this case, and in general for geometries which have many coincident segments. In particular, this applies to computing the Hausdorff distance between two identical geometries.  This situation can easily happen when querying a dataset against itself. So it was essential to solve this problem.  Even worse, detecting the very small discrepancy required an accuracy tolerance of small size, which also leads to bad performance.

The problem is that the maximum distance bounding function depends on both the segment distance and the segment length. When the segment distance is very small (or zero), the distance bound is dominated by the segment length, so subdivision will continue until all segments are shorter than the accuracy tolerance.  This lead to a large number of subsegments being generated during search, particularly when the tolerance is small (as required in the above case).

The solution is to check subsegments with zero distance to see if they are coincident with a segment of the target geometry. If so, there is no need to bisect the segment further, since subsegments must also have distance zero. With this check in place, identical (and nearly so) cases executes as fast as more general cases of the same size.  Equally importantly, this detects very small discrepancies regardless of the accuracy tolerance.

For the record, the GIS-SE case now executes in about 45 ms, and detects the tiny discrepancy of 9 orders of magnitude smaller than the input geometry.

The Hausdorff distance of ~0.00099

Handling Polygonal Input

If the Hausdorff distance is attained at a point lying on an edge then densifying the linework is sufficient.  But for polygonal query geometries the farthest point can occur in the interior of the area: 
The Directed Hausdorff distance is attained at an interior point of the query polygon

To find the farthest interior point the adaptive branch-and-bound approach can be used in the area domain.  Conveniently, JTS already implements this in the MaximumInscribedCircle and LargestEmptyCircle classes (see this blog post.) In particular, LargestEmptyCircle  supports constraining the result to lie inside an area, which is exactly what is needed for the Hausdorff distance.  The target geometry is treated as obstacles, and the polygonal element(s) of the query geometry are the constraints on the location of the empty circle centre.
Directed Hausdorff Distance with multiple area constraints and heterogeneous obstacles

The LargestEmptyCircle algorithm is complex, so it might seem that it could significantly decrease performance.  In fact, it only adds an overhead of about 30%, and for many inputs it's not even noticeable.  Also, if there is no need to determine farthest points in the interior of polygons, this overhead can be avoided by using only polygon linework (i.e. the boundary) as input.  

Currently most Hausdorff distance algorithms operate on point sets, with a very few supporting linear geometry.  There seem to be none which compute the Hausdorff distance for polygonal geometries.  While this might seem an uncommon use case, in fact it's essential to support another new capability of the algorithm: computing the isFullyWithinDistance predicate for polygonal geometries.  

isFullyWithinDistance

Distance-based queries often require determining only whether the distance is less than a given value, not the actual distance value itself.  This boolean predicate can be evaluated much faster than the full distance determination, since the computation can short-circuit as soon as any point is found which confirms being over the distance limit.  It also allows using other geometric properties (such as envelopes) for a quick initial check. For shortest distance, this approach is provided by Geometry.isWithinDistance (and supporting methods in DistanceOp and other classes.). 

The equivalent predicate for Hausdorff distance is called isFullyWithinDistance.  It tests whether all points of a geometry are within a specified distance of another geometry.  This is defined in terms of the directed Hausdorff distance (and is thus an asymmetric relationship):

  isFullyWithinDistance(A,B,d) = DHD(A,B) <= d 

The DirectedHausdorffDistance class provides this predicate via the isFullyWithinDistance(A,B,dist) function.  Because the new class supports all types of input geometry (including polygons), the predicate is fully general.  For even faster performance in batch queries it can be executed in prepared mode via the isFullyWithinDistance(A,dist) method.  This mode caches the spatial indexes built on the target geometry so they can be reused.

For a performance example, consider a dataset of European boundaries (countries and islands) containing about 28K vertices.  The boundary of Germany is used as the target geometry.


If isFullyWithinDistance is run with a distance limit of 20, it takes about 60 ms.  

There's no direct comparison for DiscreteHausdorffDistance, but if that class is used to compute the directed Hausdorff distance with a conservative densification factor of 0.1, the time is about 1100 ms.  Another point of comparison is to run a shortest distance query.  This takes only 21 ms - but it's doing much less work.

A better implementation for ST_DFullyWithin

Another way to implement isFullyWithinDistance is to compute the buffer(d) of geometry B and test whether it covers A:

  isFullyWithinDistance(A,B,d) = B.buffer(d).covers(A) 

This is how the ST_DFullyWithin function in PostGIS works now.  It's a reasonable design choice given the current lack of a performant Hausdorff distance implementation.  However, there are a few problems with using buffer:
  • Buffers of complex geometry can be slow to compute, especially for large distances
  • There's a chance of robustness bugs affecting the computed buffer
  • Buffers are linearized approximations, so there is a likelihood of false negatives for query geometries which lie close to the buffer boundary
- image of buffer quantization causing failures

Now, the DirectedHausdorffDistance implementation of isFullyWithinDistance can make this function faster, more accurate, more robust and cacheable.  (And of course, the ST_HausdorffDistance function can benefit as well.)

Summary

The JTS DirectedHausdorffDistance class provides fast, cacheable, easy-to-use computation of Hausdorff distances and the isFullyWithinDistance predicate for all JTS geometry types.  This is a major improvement over the old JTS DiscreteHausdorffDistance class, and essentially fully replaces it.  More generally, it fills a notable gap in open-source geospatial functionality.  It will allow many systems to provide a high-quality implementation for Hausdorff distance.


Monday, 23 February 2026

The Hausdorff Distance Challenge

The Hausdorff Distance is a useful spatial function which can appear slightly mysterious. Partly this is due to the name.  It honours Felix Hausdorff, one of the founding fathers of topology, and a polymath who was creative in music and literature as well as mathematics.    

Felix Hausdorff  (1868-1942)

But the name conveys nothing about why this function is useful, or how it is different to the more familiar shortest distance.  The key difference is: the shortest distance tells you how close things are, but the Hausdorff distance tells you how far apart they are. So a more descriptive name might be "farthest distance" or "maximum distance".  With due respect to Dr. Hausdorff, this is one of those historical artifacts of nomenclature that deserves a refresh.  (Especially since it's becoming recognized that the core concept was actually first published by the Romanian mathematician Dimitrie Pompeiu.  Users of the future will be grateful to be spared invoking the ST_PompeiuHausdorffDistance function.)

Definition

The formal definition of the Hausdorff distance (HD) is 

     HD(A,B) = max( DHD(A,B), DHD(B,A) )

where DHD is the directed Hausdorff Distance (DHD):

     DHD(A,B) = max a ∈ A dist(a,B)

with dist(a,B) being the usual shortest distance between point a and geometry B:

     dist(a,B) =  min b ∈ B dist(a,b) 

The Hausdorff distance is symmetric and is a true distance metric.  The directed Hausdorff distance is asymmetric.  Both can be useful in different contexts.  The directed version is arguably more fundamental.  (It's certainly where the bulk of the implementation effort lies.)
Directed Hausdorff Distance is asymmetric

The main application of the Hausdorff distance is in determining how well two datasets match, by providing a measure of their similarity.  In spatial applications these are typically geometries such as lines or polygons, but they can also be point clouds or raster images.  The Hausdorff distance is much more useful than shortest distance as a similarity measure because it gives information about all the points in a shape, not just a single closest point.  While shortest distance puts a bound on how far a single point is from the target, the Hausdorff distance is a bound on every point in the query shape. So in the figure below the two lines have a small shortest distance, but the Hausdorff distance reveals that they are actually far apart at some points.
Hausdorff Distance VS Shortest Distance

The Implementation Challenge

A key difference between shortest distance and Hausdorff distance is that the pair of points defining the shortest distance always includes at least one vertex, whereas the Hausdorff distance can occur at non-vertex points. For lines, the Hausdorff distance can occur anywhere on the edges: 

For polygons it can occur on edges or in the interior of the query area:

This makes the Hausdorff distance substantially harder to implement for general 2D geometries.  While the shortest distance can be determined simply by evaluating the distance at the finite set of vertices on each geometry, the Hausdorff distance requires a way to evaluate a finite set of points out of the infinite number of non-vertex points. 

Perhaps this is why it's so hard to find an implementation of Hausdorff distance for general 2D geometry.  (Or is there just no need for a fast accurate general-purpose Hausdorff distance?  Surely not...)  There's some implementations for point sets, and at least one for the specific case of convex polygons.  There's a couple which may support lines (here and here), but in a seemingly crude way.  And I haven't found a single one for general polygons.  Excellent - it's good to have a challenge!

Discrete Hausdorff Distance

A simple approach is to discretize the input linework by densifying the linework.  The Hausdorff distance is then evaluated over the original and added vertices. The JTS Topology Suite class DiscreteHausdorffDistance implements this approach.  This algorithm was developed many years ago (2008) for use in the RoadMatcher linear network conflation tool.  It worked well enough for that use case, since inputs were typically small and the accuracy was "good enough".  But it has some serious problems:
  • achieving accuracy requires a high degree of densification of every edge, which means slow performance
  • if the Hausdorff distance occurs at a vertex, then densification is not needed, but this is impossible to determine a priori
  • the user generally has no idea what level of densification (if any) is required to determine a result of required accuracy (this is particularly problematic in automated batch processing, where geometries may require varying amounts of densification)
  • the use of a densification factor rather than a maximum segment length was a mistake.  It is hard to determine the factor needed for a desired distance accuracy, and it causes over-densification of short edges
  • it is very slow when the inputs are equal or very similar (as shown in this issue)
  • polygonal inputs are not supported
  • the internal shortest distance computation is inefficient, since it does not use an indexed algorithm
Some of these flaws could be fixed.  For instance, shortest distance computation can be improved by using IndexedFacetDistance (which was not available at the time of development). And densification could be controlled by a maximum segment length instead of a factor.  But addressing all these issues requires a fundamental rethinking of the algorithm. 

Given the wide deployment of JTS and its C++ port GEOS, any improvement stands to benefit a huge number of users.  And after 18 years it's high time this clunky old code was replaced.  So I'm happy to announce that I'm working on an entirely new implementation for Hausdorff distance which solves all the issues above.  Expect a blog post soon!