Wednesday, 30 April 2025

Coverage Cleaning in JTS

The JTS Topology Suite has been rolling out the capability to manage polygonal coverages.  It supports modelling polygonal coverages as arrays of discrete polygonal geometries.  This is simple to work with and allows using all of the wide variety of JTS algorithms.  Coverage topology allows operations such as CoverageSimplifier and CoverageUnion to be highly performant and effective.

The key requirement for coverage operations is that the input is a valid polygonal coverage.  JTS provides the CoverageValidator to check a dataset to see if it is a valid coverage, and provide full information on the location of topology errors.  However, this leaves the question of how to convert a dataset with errors into a valid coverage.  

There are two options for fixing coverages: manual editing, and automated cleaningCoverageValidator provides information to identify the location of all errors. However, manually fixing them is time-consuming, and difficult in environments which lack convenient editing tools. And of course it simply isn't feasible in automated workflows.

What users really want is automated coverage cleaning.  This is a major pain point (witness the numerous questions about how to do this; for example, this and this and this).  Clearly, cleaning coverages is considered critical!

There are several existing tools to do this:

  • mapshaper is perhaps the most popular tool currently.  It supports snapping to improve dataset quality, along with overlap and gap merging.  It is written in Javascript, which provides surprisingly good performance, but makes it awkward to integrate in other tools.
  • pprepair is an academically-advanced algorithm which uses triangulation as the basis for cleaning.  It relies on CGAL, which limits appeal for integration
  • GRASS v.clean is a long-standing cleaning tool with an extensive set of options for carrying out various kinds of cleaning

But in the JTS/GEOS ecosystem it's been a gap that needs filling (so to speak).  So I'm excited to announce the the arrival of the JTS CoverageCleaner.

JTS CoverageCleaner class

The JTS CoverageCleaner class provides the essential capabilities of cleaning polygonal data to form a valid coverage:

  • Snapping: snapping vertices and lines eliminates small discrepancies and narrow gaps, and improves the robustness of subsequent processing
  • Overlap Merging: of two or more polygons are merged with a neighbouring polygon.  The merge target can be chosen to be the neighbour with the longest shared border, the largest or smallest area, or the one with minimum input index (which allows priority-based merging).
  • Gap Detection and Merging: Gaps are defined as enclosed empty areas with width less than a distance tolerance.  Width is computed as the diameter of the Maximum Inscribed Circle of the gap polygon. (This is an improvement over other tools, which only offer cruder area or roundness-metric tests to identify gaps to be merged.) Gaps can be filled and merged with an adjacent polygon. 
The output of the cleaner is a valid polygonal coverage that contains no overlaps and no gaps narrower than the given tolerance.  The merged overlaps and gaps are also available for visualization and QA purposes.

Other aspects of cleaning data are provided by separate JTS classes.  Invalid polygonal geometries can be preprocessed with GeometryFixer to repair them.  CoverageSimplifier can be used to reduce the vertex size of cleaned datasets.  And a forthcoming coverage precision reducer will allow controlling the amount of precision maintained (which can reduce the space required for some storage formats).

Algorithm Description

This is a good example of how the breadth of JTS capabilities makes it easier to develop new algorithms:

  • the SnappingNoder was developed for OverlayNG, and turned out to be ideal for eliminating minor errors and providing a robust basis for the rest of the process
  • the Polygonizer allows fast recreation of a coverage topology from the noded linework
  • Spatial indexes (STRtree and Quadtree) and the fast predicates of RelateNG provide performance for operating in the discrete geometric domain
  • the new fast MaximumInscribedCircle.isRadiusWithin predicate allows using polygon width for gap detection
The cleaning algorithm works as follows:

Input

The input is an array of valid polygonal geometries.  This is the same structure used in other coverage operations.  However, it may or may not not have valid coverage topology.  The array position acts as an implicit identifier for the coverage elements.

Snapping and Noding

Input linework is extracted and noded using the SnappingNoder.  Snapping vertices and lines using a distance tolerance eliminates small discrepancies and narrow gaps, and improves the robustness of noding.  By default a very small snapping distance is used (based on the diameter of the dataset.)  The user can specifiy a custom snapping distance. Larger distances can be more effective at cleaning the linework.  However, too large a distance can introduce unwanted alteration to the data.

Polygonizing and Resultant Categorization

The noded linework is polygonized using the Polygonizer. This forms a clean topology of resultant polygons.  Resultant polygons are associated with their parent input polygon(s) via their interior points and point-in-polygon tests. Resultants are categorized according to their number of parents:
  • Result area:  one parent (which identifies the associated input polygon)
  • Overlap: multiple parents
  • Gap: no parents

Overlap Merging

Overlaps are always merged, since they violate coverage topology.  They are merged with an adjacent result area chosen via a specified merge strategy.  The strategy indicates whether the merge target is chosen to be the neighbour with the longest shared border, the largest or smallest area, or the one with minimum input index (which allows priority-based merging).

Gap Detection and Merging

Gaps in the created coverage do not invalidate coverage topology, but they are often an artifact of invalidities, or may be errors in the original linework.  For this reason a distance tolerance can be used to select gaps to be merged.  Gaps which are narrower than the width tolerance are merged with the neighbour result area with longest shared border.  The width of the gap polygon is determined by the diameter of the MaximumInscribedCircle.  (This is an improvement over other tools, which only provide cruder area or roundness-metric tests to identify gaps.)
Identifying narrow gaps to be merged via Maximum Inscribed Circle radius

Output

The output is an array of the same size as the input array.  Each element is the cleaned version of the corresponding input element, and the entire array forms a valid polygonal coverage.  Due to snapping very small inputs may disappear.  The output may still contain gaps larger than the gap width tolerance used.  (One way to to detect these is to union the coverage - the result will have holes where gaps exist.) Cleaning can be re-run with a larger tolerance to remove more gaps.

Examples

Given the number of tools available to clean polygonal coverages, it's slightly surprising that you don't have to look hard to find datasets containing coverage topology errors.  But that's good for testing!  Here are some examples demonstrating the behaviour and performance of CoverageCleaner.

Example 1: Montana County Commissioner Districts

This is one of the jurisdiction coverages available here.  The dataset contains 177 areas with 470,229 vertices.  Running CoverageValidator finds errors at over 12,000 locations:
Zooming in on an area containing errors using the JTS TestBuilder Magnify Topology tool shows that errors are caused by small misalignments in the linework of adjacent polygons, creating both gaps and overlaps:

Running the Coverage Cleaner with the default snapping tolerance and a gap width of 10 produces a clean coverage in 0.88 seconds.  The effect of snapping removes 2043 overlaps and 1793 gaps.  After topology building, merging removes a further 14 overlaps and 12 gaps. (This is determined by running the cleaning process with both default snapping tolerance and a tolerance of 0, and counting the merged overlaps and gap polygons available from the cleaner instance.)

Example 2: Communes in France Department 71

This example is a dataset representing communes in the French department 71 (Saône-et-Loire).  It may have been assembled by aggregating cadastral parcels, which is a common cause of discrepancies due to different data origins.  The dataset contains 564 areas with 612,238 vertices.  Validating the coverage reveals over 362,000 error locations (and in fact this doesn't represent all of the gaps present).

Zooming shows some example overlaps and gaps. Magnify Topology is not needed since the discrepancies are on the order of metres in size.

Running the Coverage Cleaner with default snapping tolerance and a Gap Width of 50 produces a clean coverage in 23 seconds.  The slower performance over Example 1 is due to the much larger number of gaps and overlaps: 29,307 overlaps and 28,409 gaps.  They are almost all too wide to be handled by snapping, so they must be merged, which requires more processing.

Here's a before-and-after view of an area of the cleaned coverage, showing the effect of merging gaps and overlaps:



Gap Merging Quality Issues

Given the size and number of issues in the France-71-communes dataset, it's not surprising that it exhibits two known problems with the current gap merging strategy: spikes creating by merged gaps, and unmerged narrow gap parts.

Here is an example of how merging a gap to an adjacent polygon can create a spike in the result polygon.  (Merging overlaps does not produce these kinds of errors, since the overlap was part of the merge target originally.)

This is an example of a wide gap which is not merged, but which has narrow portions which could be "clipped off" and merged:

Improving the quality of gap merging is an area for further research and development.

GEOS and PostGIS Ports

As usual, this work will be ported to GEOS soon, and then exposed in PostGIS as a new coverage function,  It should also show up in downstream projects such as Shapely and hopefully QGIS.

Further Enhancements

Overlap and Gap Handling

There are other possibilities for controlling overlap merging and gap filling.  Choosing merge targets could be determined by a priority value on each input area.  Gaps could be determined by area or roundness measure criteria, as provided in other systems (although they seem less precise than using polygon width.)     

Gap Partitioning and Spike Removal 

Gaps which are very long can extend along multiple polygons. If a gap of this nature is merged to single polygon it will create a spike in the merged polygon.  Instead, it would be better to partition the gap into multiple pieces at polygon nodes and merge the parts separately.

A more complex situation is when a gap contains both narrow and wide portions.  An example of this is a parcel coverage with blocks separated by narrow streets and wider intersections and squares.  Ideally the narrow parts could be partitioned off and merged, leaving the wide portions unmerged.  

Gore Removal

Gores are narrow "inlets" along the border of in a coverage.   They have a similar shape to gaps, but since they are not closed they are not identified as gaps by the cleaning process.  They could be identified by a different process, closed off, and then merged (ideally with partitioning as discussed above).

Coverage precision reduction

A common requirement in data management is to reduce the precision of data coordinates.  For coverages this cannot be done piecewise, but must be performed as a global operation over the entire coverage, since rounding coordinates independently can create line intersections.  The SnapRoundingNoder provides an effective way of doing this.


Monday, 27 May 2024

RelateNG Performance

A previous post introduced a new algorithm in the JTS Topology Suite called RelateNG.  It computes topological relationships between geometries using the Dimensionally-Extended 9 Intersection Model  (DE-9IM) model.  

This algorithm is fundamental to a large proportion of spatial queries executed in numerous geospatial environments.  It would not be surprising to learn that the Relate algorithm is executed billions of times per day across the world's data centres.  Given this, performance is a critical metric.  

The original JTS RelateOp algorithm was hamstrung by its design, which required computing a full topology graph for each relationship evaluation.  The subsequent development of PreparedGeometry provided a significant performance boost for common spatial predicates such as intersects and covers.  A useful side-effect is that it is less susceptible to geometric robustness problems.  However, it has some drawbacks:

  • it only implements a small set of predicates
  • the design does not easily extend to more complex predicates
  • it does not support GeometryCollection inputs
  • it is a completely separate codebase to the general-purpose RelateOp, which increases maintenance effort and increases the risk of bugs and behaviour discrepancies
RelateNG solves all of these problems.  In addition, it provides much better performance for non-prepared (stateless) calls.  This is important for systems which do not execute spatial queries in a stateful way (due to architectural limitations, or simply lack of attention to performance), and thus cannot use prepared mode.

Performance Tests

Here are some examples of performance metrics.  
Note that the data used in the tests is sometimes subsetted to avoid skewing the results with many false results due to non-interaction between the input geometries.  This reflects the typical deployment of these predicates in systems which perform a primary extent filter (i.e. via a spatial index) before executing the full predicate.
Some of datasets are in geodetic coordinates, which strictly speaking JTS does not handle. But this does not matter for the purposes of performance testing.

Point / Polygon

This test queries a MultiPolygon of the outline of Europe (including larger islands) against a synthetic dataset of random points.  The test evaluates the intersects predicate, which is really the only one of interest in the Point / Polygon case.  

The test data metrics are:
  • Query MultiPolygon: 461 polygons, 22,675 vertices
  • Target Points: 10,000 points

Timings

OpIntersectsIntersects Prep
Relate61.7 s21 ms
RelateNG0.1 s19 ms

The result clearly shows the enormous improvement in the stateless case, since RelateNG does not build topology on the input polygon.  The results are very similar in the prepared case.  This is as expected, since both codebases run a simple Point-in-Polygon test using a cached spatial index.

Line / Line

This test uses a dataset of major rivers of the continental United States.  A single river mainstem is queried against a subset of river branches using the intersects and touches relationships, in stateless and prepared modes.

The test data metrics are:
  • Query Line: 6,975 vertices
  • Target Lines: 407 lines with 47,328 vertices
 

Timings (per 100 executions of the operation)

OpIntersectsIntersects PrepTouchesTouches Prep
Relate38.2 s133 ms36 sN/A
RelateNG1.18 s142 ms2.05 s2.03 s

This shows the much better performance of RelateNG.  RelateNG can evaluate touches in prepared mode, but the performance is similar to stateless mode, since currently indexes are not cached for Line/Line cases.  This could be improved in a future release.

Polygon / Polygon

This test uses data from two polygonal datasets:
The tests query an administrative unit polygon against a subset of Bedrock polygons which intersect its extent, using intersects and covers in stateless and prepared modes. 

The test data metrics are:
  • GADM unit: 4,017 vertices
  • Bedrock polygons: 4,318 polygon with 337,650 vertices

Timings (per 100 executions of the operation)

OpIntersectsIntersects PrepCoversCovers Prep
Relate61.7 s534 ms54.9 s842 ms
RelateNG5.8 s595 ms6.4 s943 ms

Polygon / Polygon - Custom Relate Pattern

The tests shows the ability of RelateNG to efficiently evaluate arbitrary Relate Intersection Matrix patterns.  In this case, the pattern used is F***0****, which corresponds to a relationship which could be called "point-touches": two geometries have boundaries which intersect in only one (or more) points (dimension 0), but whose interiors do not intersect.

This test uses data from The GADM Level 1 boundaries for Canada.  This has the advantage that Canada contains a rare example of 4 boundaries meeting at a single point (the jurisdictions of Saskatchewan, Manitoba, Northwest Territories, and Nunavut).

The test data metrics are:
  • GADM Canada Level 1: 13 polygons, 4,005,926 vertices

Timings

OpF***0****F***0**** Prep
Relate504 sN/A
RelateNG9.8 s6.6 s

As expected, for this test RelateNG is far more performant than Relate.  This is due to its ability to analyze Intersection Matrix patterns and perform only topology tests that are necessary, as well as not building a full topological structure of the inputs.  The test shows the effect of caching spatial indexes in prepared mode, although stateless mode is very efficient as well.

Analysis of Results

Clearly RelateNG is far more performant than Relate when running in non-prepared mode.  The PreparedGeometry implementation is slightly faster (which confirms the efficiency of its original design), but not by very much.  The difference may be a consequence of the general-purpose and thus more complex code and data structures in RelateNG. Closing this gap may be an area for future research.
One thing is certain: the RelateNG algorithm design provides much more scope for adding case-specific optimizations.  If you have a significant use case which could be improved by further targeted optimization improvements, let me know in the comments!
It will be interesting to re-evaluate these tests once RelateNG is ported to GEOS.  Sometimes (but not always) a C++ implementation can be significantly faster than Java, due to more opportunities for code and compiler optimization.

Thursday, 16 May 2024

JTS Topological Relationships - the Next Generation

The most fundamental and widely-used operations in the JTS Topology Suite are the ones that evaluate topological relationships between geometries.  JTS implements the Dimensionally-Extended 9 Intersection Model (DE-9IM), as defined in the OGC Simple Features specification, in the RelateOp API.  

DE-9IM matrix for overlapping polygons

The RelateOp algorithm was the very first one implemented during the initial JTS development, over 20 years ago.  At that time it was an appealing idea to implement a general-purpose topology framework (the GeometryGraph package), and use it to support topological predicates, overlay, and buffering.  However, some disadvantages of this approach have become evident over time:

  • the need to create a topological graph structure limits the ability to improve performance. This has led to the implementation of PreparedGeometry - but that adds further complexity to the codebase, and supports only a limited set of predicates.
  • a large number of code dependencies make it hard to fix problems and improve semantics
  • constructing a full topology graph increases exposure to geometric robustness errors
  • GeometryCollections are not supported (initially because the OGC did not define the semantics for this, and now because adding this capability is difficult)
There's an extensive list of issues relating (ha!) to RelateOp here.

The importance of this functionality is especially significant since the same algorithm is implemented in GEOS.  That codebase is used to evaluate spatial queries in popular spatial APIs such as Shapely and R-sf, and numerous systems such as PostGISDuckDBSpatialLiteQGIS, and GDAL (to name just a few).  It would not be surprising to learn that the RelateOp algorithm is executed billions of times per day across the world's CPUs.

During the subsequent years of working on JTS I realized that there was a better way to evaluate topological relationships.  It would required a ground-up rewrite, but would avoid the shortcomings of RelateOp and provide better performance and a more tractable codebase.  Thanks to my employer Crunchy Data I have finally been able to make this idea a reality.  Soon JTS will provide a new algorithm for topological relationships called RelateNG.

Key Features of RelateNG

The RelateNG algorithm incorporates a broad spectrum of improvements over RelateOp in the areas of functionality, robustness, and performance.  It provides the following features:

  • Efficient short-circuited evaluation of topological predicates (including matching custom DE-9IM matrix patterns)
  • Optimized repeated evaluation of predicates against a single geometry via cached spatial indexes (AKA "prepared mode")
  • Robust computation (only point-local geometry topology is computed, so invalid topology does not cause failures)
  • GeometryCollection inputs containing mixed types and overlapping polygons are supported, using union semantics.
  • Zero-length LineStrings are treated as being topologically identical to Points.
  • Support for BoundaryNodeRules.

Using the RelateNG API

The main entry point is the RelateNG class.  It supports evaluating topological relationships in three different ways:

  • Evaluating a standard OGC named boolean binary predicate, specified via a TopologyPredicate instance.  Standard predicates are obtained from the RelatePredicate factory functions intersects, contains, overlaps, etc.
  • Testing an arbitrary DE-9IM relationship by matching an intersection matrix pattern (e.g. "T**FF*FF*", which is the pattern for a relation called Contains Properly).
  • Computing the full value of a DE-9IM IntersectionMatrix.
RelateNG operations can be executed in two modes: stateless and prepared.  Stateless mode is provided by the relate static functions, which accept two input geometries and evaluate a predicate.  For prepared mode, a single geometry is provided to the prepare functions to create a RelateNG instance.  Then the evaluate instance methods can be used to evaluate spatial predicates on a sequence of geometries. The instance creates spatial indexes (lazily) to make topology evaluation much more efficient.  Note that even stateless mode can be significantly faster than the current implementation, due to short-circuiting and other internal heuristics.

Here is an example of matching an intersection matrix pattern, in stateless mode:
boolean isMatched = RelateNG.relate(geomA, geomB, "T**FF*FF*");
Here is an example of setting up a geometry in prepared mode, and evaluating a named predicate on it:
RelateNG rng = RelateNG.prepare(geomA);
for (Geometry geomB : geomSet) {
    boolean predValue = rng.evaluate(geomB, RelatePredicate.intersects());
}

Rolling It Out

It's exciting to launch a major improvement on such a core piece of spatial functionality.  The Crunchy spatial team will get busy on porting this algorithm to GEOS.  From there it should get extensive usage in downstream projects.  We're looking forward to hearing feedback from our own PostGIS clients as well as other users.  We're always happy to be able to reduce query times and equally importantly, carbon footprints.  

In further blog posts I'll describe the RelateNG algorithm design and provide some examples of performance metrics.

Future Ideas

The RelateNG implementation provides an excellent foundation to build out some interesting extensions to the fundamental DE-9IM concept.

Extended Patterns

The current DE-9IM pattern language is quite limited.  In fact, it's not even powerful enough to express the standard named predicates.  It could be improved by adding features like:

  • disjunctive combinations of patterns.  For example, touches is defined by  "FT******* | F**T***** | F***T****"
  • dimension guards to specify which dimensions a pattern applies to.  For example, overlaps is defined by "[0,2] T*T***T** | [1] 1*T***T**"
  • while we're at it, might as well support dotted notation and spaces for readability; e.g. "FT*.***.***"

Approximate Spatial Relationships

A requirement that comes up occasionally is to compute approximate spatial relationships between imprecise data using a distance tolerance.  This is useful in datasets where features don't match exactly due to data inaccuracy. Because RelateNG creates topology only in the neighbourhood of specified points, it should be possible to specify the neighbourhood size using a distance tolerance.  Essentially, vertices and line intersections will "snap together" to determine the effective topology.  Currently within-distance queries are often used to compute "approximate intersects".  Adding a distance tolerance capability to RelateNG will allow other kinds of approximate relationships to be evaluated as well.

Further Performance Optimizations?

A challenge with implementing algorithms over a wide variety of spatial types and use cases is how to provide general-purpose code which matches (or exceeds) the efficiency of more targeted implementations.  RelateNG analyzes the input geometries and the predicate under evaluation to tune strategies to reduce the amount of work needed to evaluate the DE-9IM.  It may be that profiling specific use cases reveals further hotspots in the code which can be improved by additional optimizations.  

Curve Support

GEOS has recently added support for representing geometries with curves.  The RelateNG design is modular enough that it should be possible to extend it to allow evaluating relationships for geometries with curves.  


Monday, 13 March 2023

Simplifying Polygonal Coverages with JTS

A new capability for the JTS Topology Suite is operations to process Simple Polygonal Coverages.  A Simple Polygon Coverage is a set of edge-matched, non-overlapping polygonal geometries (which may be non-contiguous, and have holes).  Typically this is used to model an area in which every point has a value from some domain.  A classic example of a polygonal coverage is a set of administrative boundaries, such as those available from GADM or Natural Earth.

GADM polygonal coverage for France Level 1   ( 198,350 vertices)

The first coverage operations provided are:

  • Coverage Validation, to check if a set of polygons forms a topologically-valid coverage
  • Coverage Union, which takes advantage of coverage topology to provide a very fast union operation

Another operation on polygonal coverages is simplification.  Simplifying a coverage reduces the number of vertices it contains, while preserving the coverage topology.  Preserving topology means that the simplified polygons still form a valid coverage, and that polygons which had a common edge in the input coverage (i.e. which were adjacent) still share an edge in the simplified result.  Reducing dataset size via simplification can provide more efficient storage and download, and faster visualization at smaller scales.

France coverage simplified with tolerance 0.01   (7,918 vertices)
The decrease in resolution is hardly noticeable at this scale

Closeup of simplified coverage (tolerance = 0.01)

Simplification is perhaps the most requested algorithm for polygonal coverages (for example, see GIS StackExchange questions here, here and here.)  My colleague Paul Ramsey sometimes calls it the "killer app" for polygonal coverages.  Earlier this century there was no easily-available software implementing this capability.  Users often had to resort to the complex approach of extracting the linework from the dataset, dissolving it, simplifying the lines (with a tool which would not cause further overlaps), re-polygonizing, and re-attaching feature attribution to the result geometries. 

More recently tooling has emerged to provide this functionality.  Simplification is the raison-d'etre of the widely-used and cited MapShaper tool.  GRASS has the v.generalize module.  And good old OpenJUMP added Simplify Polygon Coverage a while back.  

JTS has provided the TopologyPreservingSimplifier algorithm for many years, but this only operates on Polygons and MultiPolygons, not on polygonal coverages.  (Attempting to use it on polygonal coverages can introduce gaps and overlaps, resulting in complaints like this.)  But coverage simplification has been lacking in JTS/GEOS - until now.

JTS Coverage Simplification

Recent work on Polygon Hulls provided ideas for an implementation of coverage simplification. The CoverageSimplifier class uses an area-based simplification approach similar to the well-known Visvalingam-Whyatt algorithm. This provides good results for simplifying areal features (as opposed to linear ones).  It's possible to use a Douglas-Peucker based approach as well, so this may be a future option.

The degree of simplification is determined by a tolerance value.  The value is equivalent roughly to the maximum distance a simplified edge can change (technically speaking, it is the square root of the area tolerance for the Visvalingam-Whyatt algorithm).  

The algorithm progressively simplifies all coverage edges, while ensuring that no edges cross another edge, or touch at endpoints.  This provides the maximum amount of simplification (up to the tolerance) while still maintaining the coverage topology.

The coverage of course should be valid according to the JTS CoverageValidator class.  Invalid coverages can still be simplified, but only edges with valid topology will have it maintained.

France coverage simplified with tolerance = 0.1   ( 1,928 vertices)

France coverage simplified with tolerance = 0.5    ( 1,527 vertices)
The coverage topology (adjacency relationship) is always preserved.


Inner Simplification

The implementation also provides the interesting option of Inner Simplification.  This mode simplifies only inner, shared edges, leaving the outer boundary edges unchanged.  This allows portions of a coverage to be simplified, by ensuring that the simplified polygons will fit exactly into the original coverage.  (This GIS-SE question is an example of how this can be used.)

Inner Simplification of France coverage

It would also be possible to provide Outer Simplification, where only outer edges are simplified.  It's not clear what the use case would be for this - if you have ideas, leave a comment!

GEOS and PostGIS

As usual, this algorithm will be ported to GEOS, from where it will be available to downstream projects such as PostGIS and Shapely.  

For PostGIS, the intention is to expose this as a window function (perhaps ST_CoverageSimplify).  That will make it easy to process a set of features (records) and maintain the attributes for each feature.



Monday, 6 March 2023

Fast Coverage Union in JTS

The next operation delivered in the build-out of Simple Polygonal Coverages in the JTS Topology Suite is Coverage Union. This is simply the topological union of a set of polygons in a polygonal coverage, producing one or more polygons as the result.  (This is sometimes called "dissolve" in the context of polygonal coverages.)


Union of polygons has long been available in JTS, most recently (and robustly) as the UnaryUnion capability of OverlayNG.  This makes use of the Cascaded Union technique to provide good performance for large sets of polygons.  But the constrained structure of polygonal coverages means unions can be computed much faster than even Cascaded Union.  Essentially, the duplicated inner edges of adjacent polygons are identified and discarded, leaving only the outer boundary of the unioned area.  Because a valid polygonal coverage has edges which match exactly, identifying duplicate segments can be done with a fast equality test.  Also, there is no need for computationally-expensive techniques to ensure geometric robustness.

This is now available in JTS as the CoverageUnion class.

Performance

To test the performance of Coverage Union we need some large clean polygonal coverages. These are nicely provided by the GADM repository of worldwide administrative areas.  

Here is some metrics comparing the performance of Coverage Union against OverlayNG Unary Union (which uses the Cascaded Union technique).  Coverage Union is much faster for all sizes of dataset.


DatasetPolygonsVerticesCoverage UnionOverlayNG UnionTimes Faster
France level 43,728407,1020.3 s8.5 s28 x
France level 536,612729,5731.07 s13.9 s13 x
Germany level 411,3022,162,1840.68 s27.3 s40 x

Union by Attribute

It's worth noting that unioning a set of polygons in a coverage leaves the boundary of the unioned area perfectly unchanged. So subsets of a coverage can be unioned, and the result still forms a valid polygonal coverage.  This provides a fast Union by Attribute capability, which is a common spatial requirement

GEOS and PostGIS

GEOS already supports a GEOSCoverageUnion operation.  At some point it would be nice to expose this in PostGIS, most likely as a new aggregate function (perhaps ST_UnionCoverage).


Friday, 27 January 2023

Alpha Shapes in JTS

Recently JTS gained the ability to compute Concave Hulls of point sets.  The algorithm used is based the Chi-shapes approach described by Duckham et al.   It works by eroding border triangles from the Delaunay Triangulation of the input points, in order of longest triangle edge length, down to a threshold length provided as a parameter value.  

Concave Hull of Ukraine (Edge-length = 10)

Alpha-Shapes

Another popular approach for computing concave hulls is the Alpha-shape construction originally proposed by Edelsbrunner et alPapers describing alpha-shapes tend to be lengthy and somewhat opaque theoretical expositions, but the actual algorithm is fairly simple.  It is also based on eroding Delaunay triangles, but uses the circumradius of the triangles as the criteria for removal.  Border triangles whose circumradius is greater than alpha are removed.  

Another way of understanding this is to imagine triangles on the border of the triangulation being "scooped out" by a disc of radius alpha:

Construction of an Alpha-shape via "scooping" with an alpha-radius disc

Given the similar basis of both algorithms, it was easy to generalize the JTS ConcaveHull class to be able to compute alpha-shapes as well.  This is now available in JTS via the ConcaveHull.alphaShape function.

Seeking Alpha

An issue that became apparent was: what is the definition of alpha?  There are at least three alternatives in use:

  • The original Edelsbrunner 1983 paper defines alpha as the inverse of the radius of the eroding disc.
  • In a later 1994 paper Edelsbrunner defines alpha as the radius of the eroding disc.  The same definition is used in his 2010 survey of alpha shape research.
  • The CGAL geometry library implements alpha to be the square of the radius of the eroding disc.  (The derivation of this definition is not obvious to me; perhaps it is intended to make the parameter have areal units?)

The simplest option is to define alpha to be the radius of the eroding disc.  This has the additional benefit of being congruent with the edge-length parameter.  For both parameters, 0 produces a result with maximum concaveness, and for values larger than some (data-dependent) value the result is the convex hull of the input.

Alpha-Shape VS Concave Hull

An obvious question is: how do alpha-shapes differ to edge-length-based concave hulls?  Having both in JTS makes it easy to compare the hulls on various datasets.  For comparison purposes the alpha and edge-length parameters are chosen to produce hulls with the same area (as close as possible).

Here's an illustration of an Alpha-shape and a Concave Hull generated for the same dataset, with essentially identical area.

Overall the shapes are fairly similar.  Alpha-shapes tend to follow the data points more closely in "shallow bays", whereas the concave hull tends to include them.  Conversely, the concave hull can sometimes erode deeper bays.  The effect of keeping shallow bays is to make the Concave Hull slightly smoother than the Alpha-shape.

Avoiding Cavities

However, there is one way in which alpha-shapes can be less well-formed than concave hulls.  The measurement of a circumradius is independent of the orientation of the triangle.  This means it is not sensitive to whether the triangle has a long or short edge on the border of the triangulation.  This can result in what I am calling "cavitation". Cavitation occurs where narrow triangles reaching deep into the interior of the dataset are removed. This in turn exposes more interior triangles to being eroded.  This is visible in the example below, where the Alpha-shape contains a large cavity which is obviously not desirable. 


Alpha-shape with a "cavity"

This is shown in more detail below.  The highlighted triangle is removed due to its large circumradius, even though it has a relatively small border edge.  This in turn exposes more triangles to being removed, forming the undesirable cavity.
Alpha-shape with Delaunay triangle forming a "cavity"

The edge-length metric does not have this problem, since it considers only the edges along the (current) border of the triangulation.  (In fact, the presence of the cavity in the example above means that the equal-area comparison strategy causes more erosion artifacts in the Concave Hull than might otherwise be present.)

For an even more egregious example of this issue, here is an Alpha-shape of the Ukraine dataset:
Alpha-shape of Ukraine (alpha = 7)

The effect of cavitation is obvious.  Although it can be eliminated by increasing the alpha radius, note that Crimean Peninsula is already less well-delineated than in the Concave Hull, and increasing alpha would make this worse. 

Recommendation

Given the risk of undesired cavitation, the safest option for computing Concave Hulls is to use the Edge-Length approach, rather than Alpha-shapes.   

Future Work

  • Best of Both? - perhaps there is a strategy which combines aspects of both Alpha-shapes and Concave Hulls, to erode shallow bays but avoid the phenomenon of cavitation?
  • Disconnected Hulls - the "classic" Alpha-shape definition allows disconnected MultiPolygons as a result.  Currently, the JTS algorithm always produces a result which is a connected single polygon.  This seems like the most commonly desired behaviour, but there is a possibility to extend the implementation to optionally allow a disconnected result .  CGAL provides a parameter to control the number of polygons in the result, which could be implemented as well.

Thursday, 13 October 2022

Relational Properties of DE-9IM spatial predicates

There is an elegant mathematical theory of binary relations.  Homogeneous relations are an important subclass of binary relations in which both domains are the same.  A homogeneous relation R is a subset of all ordered pairs (x,y) with x and y elements of the domain.  This can be thought of as a boolean-valued function R(x,y), which is true if the pair has the relationship and false if not.

The restricted structure of homogeneous relations allows describing them by various properties, including:

Reflexive - R(x, x) is always true

Irreflexive - R(x, x) is never true

Symmetric - if R(x, y), then R(y, x)

Antisymmetric - if R(x, y) and R(y, x), then x = y

Transitive - if R(x, y) and R(y, z), then R(x, z)

The Dimensionally Extended 9-Intersection Model (DE-9IM) represents the topological relationship between two geometries.  

Various useful subsets of spatial relationships are specified by named spatial predicates.  These are the basis for spatial querying in many spatial systems including the JTS Topology Suite, GEOS and PostGIS.

The spatial predicates are homogeneous binary relations over the Geometry domain  They can thus be categorized in terms of the relational properties above.  The following table shows the properties of the standard predicates:

PredicateReflexive /
Irreflexive
Symmetric /
Antisymmetric
Transitive

EqualsRST
IntersectsRS-
DisjointRS-
ContainsRA-
WithinRA-
CoversRAT
CoveredByRAT
CrossesIS-
OverlapsIS-
TouchesIS-

Notes

  • Contains and Within are not Transitive because of the quirk that "Polygons do not contain their Boundary" (explained in this post).  A counterexample is a Polygon that contains a LineString lying in its boundary and interior, with the LineString containing a Point that lies in the Polygon boundary.  The Polygon does not contain the Point.  So Contains(poly, line) = true and Contains(line, pt) = true, but Contains(poly, pt) = false.   The predicates Covers and CoveredBy do not have this idiosyncrasy, and thus are transitive.

Wednesday, 24 August 2022

Validating Polygonal Coverages in JTS

The previous post discussed polygonal coverages and outlined the plan to support them in the JTS Topology Suite.  This post presents the first step of the plan: algorithms to validate polygonal coverages.  This capability is essential, since coverage algorithms rely on valid input to provide correct results.  And as will be seen below, coverage validity is usually not obvious, and cannot be taken for granted.  

As described previously, a polygonal coverage is a set of polygons which fulfils a specific set of geometric conditions.  Specifically, a set of polygons is coverage-valid if and only if it satisfies the following conditions:

  • Non-Overlapping - polygon interiors do not intersect
  • Edge-Matched (also called Vector-Clean and Fully-Noded) - the shared boundaries of adjacent polygons has the same set of vertices in both polygons
The Non-Overlapping condition ensures that no point is covered by more than one polygon.  The Edge-Matched condition ensures that coverage topology is stable under transformations such as reprojection, simplification and precision reduction (since even if vertices are coincident with a line segment in the original dataset, this is very unlikely to be the case when the data is transformed).  
An invalid coverage which violates both (L) Non-Overlapping and (R) Edge-Matched conditions (note the different vertices in the shared boundary of the right-hand pair)

Note that these rules allow a polygonal coverage to cover disjoint areas.  They also allow internal gaps to occur between polygons.  Gaps may be intentional holes, or unwanted narrow gaps caused by mismatched boundaries of otherwise adjacent polygons.  The difference is purely one of size.  In the same way, unwanted narrow "gores" may occur in valid coverages.  Detecting undesirable gaps and gores will be discussed further in a subsequent post. 

Computing Coverage Validation

Coverage validity is a global property of a set of polygons, but it can be evaluated in a local and piecewise fashion.  To confirm a coverage is valid, it is sufficient to check every polygon against each adjacent (intersecting) polygon to determine if any of the following invalid situations occur:
  • Interiors Overlap:
    • the polygon linework crosses the boundary of the adjacent polygon
    • a polygon vertex lies within the adjacent polygon
    • the polygon is a duplicate of the adjacent polygon
  • Edges do not Match:
    • two segments in the boundaries of the polygons intersect and are collinear, but are not equal 
If neither of these situations are present, then the target polygon is coverage-valid with respect to the adjacent polygon.  If all polygons are coverage-valid against every adjacent polygon then the coverage as a whole is valid.

For a given polygon it is more efficient to check all adjacent polygons together, since this allows faster checking of valid polygon boundary segments.  When validation is used on datasets which are already clean, or mostly so, this improves the overall performance of the algorithm.  

Evaluating coverage validity in a piecewise way allows the validation process to be parallelized easily, and executed incrementally if required.

JTS Coverage Validation

Validation of a single coverage polygon is provided by the JTS CoveragePolygonValidator class.  If a polygon is coverage-invalid due to one or more of the above situations, the class computes the portion(s) of the polygon boundary which cause the failure(s).  This allows the locations and number of invalidities to be determined and visualized.

The class CoverageValidator computes coverage-validity for an entire set of polygons.  It reports the invalid locations for all polygons which are not coverage-valid (if any). 

Using spatial indexing makes checking coverage validity quite performant.  For example, a coverage containing 91,384 polygons with 10,474,336 vertices took only 6.4 seconds to validate.  In this case the coverage is nearly valid, since only one invalid polygon was found. The invalid boundary linework returned by CoverageValidator allows easily visualizing the location of the issue.

A polygonal dataset of 91,384 polygons, containing a single coverage-invalid polygon

The invalid polygon is a tiny sliver, with a single vertex lying a very small distance inside an adjacent polygon.  The discrepancy is only visible using the JTS TestBuilder Reveal Topology mode.

The size of the discrepancy is very small.  The vertex causing the overlap is only 0.0000000001 units away from being valid:

[921]  POLYGON(632)
[922:4]  POLYGON(5)
Ring-CW  Vert[921:0 514]  POINT ( 960703.3910000008 884733.1892000008 )
Ring-CW  Vert[922:4:0 3]  POINT ( 960703.3910000008 884733.1893000007 )

This illustrates the importance of having fast, robust automated validity checking for polygonal coverages, and providing information about the exact location of errors.

Real-world testing

With coverage validation now available in JTS, it's interesting to apply it to publicly available datasets which (should) have coverage topology.  It is surprising how many contain validity errors.  Here are a few examples:

SourceCity of Vancouver
DatasetZoning Districts


This dataset contains 1,498 polygons with 57,632 vertices. There are 379 errors identified, which mainly consist of very small discrepancies between vertices of adjacent polygons.
Example of a discrepant vertex in a polygon


SourceBritish Ordnance Survey OpenData
DatasetBoundary-Line
Fileunitary_electoral_division_region.shp



This dataset contains 1,178 polygons with 2,174,787 vertices. There are 51 errors identified, which mainly consist of slight discrepancies between vertices of adjacent polygons. (Note that this does not include gaps, which are not detected by CoverageValidator.  There are about 100 gaps in the dataset as well.)
An example of overlapping polygons in the Electoral Division dataset


SourceHamburg Open Data Platform
DatasetVerwaltungsEinheit (Administrative Units)


The dataset (slightly reduced) contains 7 polygons with 18,254 vertices.  Coverage validation produces 64 error locations.  The errors are generally small vertex discrepancies producing overlaps.  Gaps exist as well, but are not detected by the default CoverageValidator usage.
An example of overlapping polygons (and a gap) in the VerwaltungsEinheit dataset


As always, this code will be ported to GEOS.  A further goal is to provide this capability in PostGIS, since there are likely many datasets which could benefit from this checking.  The piecewise implementation of the algorithm should mesh well with the nature of SQL query execution.

And of course the next logical step is to provide the ability to fix errors detected by coverage validation.   This is a research project for the near term.

UPDATE: my colleague Paul Ramsey pointed out that he has already ported this code to GEOS.  Now for some performance testing!