Tuesday, 8 July 2025

Fast detection of narrow polygons with JTS

Some spatial use cases require identifying "narrow" or "skinny" polygons.  A classic example is the process of cleaning polygonal coverages. Coverages (even clean ones) may contain narrow gaps between polygons which are unwanted.  But they may also contain larger gaps which are valid.  A geometric test is required to distinguish narrow polygons which are gaps (that can be merged) from wider ones (which should be retained). 

Narrowness Measures

There are various criteria which can be used to determine if a polygon is "narrow".  One approach is to use an estimate of circularity (also called compactness or roundness) such as the Polsby-Popper measure.  This is a measure based on the ratio of polygon area to the square of its perimeter.  Polygons with a measure closer to 1 are more circular than ones with lower values (which are thus more likely to be narrow).  This is easy to implement and fast to compute, but it has significant limitations:

  • it is not obvious how to pick a measure value that determines "narrowness"
  • it is based on shape-wide metrics (area and perimeter), so it may not accurately reflect the local properties of a shape (i.e. a polygon may have a low circularity measure and still contain a wide "bulge" in one place.)
  • because it is a ratio it is size-independent, so very small gaps (which should be removed) may be compact according to the measure, and are thus retained. (Adding an area limit can mitigate this, but that introduces yet another difficult-to-choose parameter.)

Maximum Inscribed Circle isRadiusWithin

Really, the best way is to determine the narrowness of a polygon is to compute its actual width.  An effective way to do this is to compute the diameter of the largest circle which lies within the polygon.  This is known as the Maximum Inscribed Circle.  The JTS Topology Suite provides the MaximumInscribedCircle class to compute the centre point and radius of the required circle (see the blog post). (The algorithm is also available in downstream projects including GEOS, PostGIS and Shapely.)  The algorithm uses an iterative approximation, which currently requires specifying an approximation tolerance distance.

However, the real requirement is only to test whether the MIC radius is less than a given length value.  The actual value of the radius is not required.  Generally, an algorithm that tests for a value can be faster and more robust than one that computes the value explicitly.  This is a perfect example of this principle.  By formulating the function as a test against a provided radius length execution can be much faster, since computation can terminate as soon as it is known that the radius must be smaller or larger than the required length.  (Determining that the MIC radius is larger than the limit is easy: an internal point is found with distance to the polygon boundary which is greater than the limit.  The radius is known to be smaller than the limit once the grid size used to generate internal points is less than one-half the limit value.  If neither of those situations occurs the approximation tolerance is used to terminate the computation.)  This has recently been implemented as the MaximumInscribedCircle.isRadiusWithin predicate.  

A further benefit of the isRadiusWithin implementation is that it removes the need to specify the approximation tolerance.  The tolerance can be determined automatically, as a fraction of the specified radius limit.  This is significant, because selecting an appropriate distance tolerance for a Maximum Inscribed Circle computation is difficult, since it has no relationship to the size of the input geometry (i.e. a geometry may have a large extent, but still be very narrow.)  This is particularly challenging when a single tolerance must be chosen for a set of inputs of varying shape - precisely the use case for which this is being built.

Demonstration

As expected, evaluating the isRadiusWithin predicate is much faster than computing the actual value of the Maximum Inscribed Circle radius.  For a demonstration we use this dataset of world country borders (exploded to treat each island as a separate polygon).   This dataset has 8,393 polygons containing a total of 366,951 vertices.  

Computing the Maximum Inscribed Circle radius for each polygon, with a tolerance of 0.00001 degrees, takes 2.68 seconds.  

Evaluating the isRadiusWithin predicate, using a distance tolerance of 1 degree, executes in 0.034 seconds - almost 80 times faster.  


Usage in Coverage Cleaning

This function is used in the CoverageCleaner class to provide a fast test for narrow polygons (as specified by the user using the gapWidth parameter).  To see how this works on a real use case,  consider a polygonal coverage dataset of communes in Departement 71 in France (Saône-et-Loire).  The dataset has 564 polygons, with 612,238 vertices. As shown in a previous post on coverage cleaning, this dataset has highly non-aligned linework.  This creates lots of gaps.  The gaps can be found by cleaning the coverage with a gap tolerance distance of 0, which reveals 28,427 gaps, with 349,042 vertices.

Departement 71 communes showing the many gaps present

Close-up of gaps (note scale)

Computing the Maximum Inscribed Circle radius explicitly for each gap takes 25.4 seconds.  In contrast, using isRadiusWithin with a distance of 0.5 takes only 1.35 seconds - 18 times faster.  Since the cleaning process itself takes only 4.9 seconds, the reduction in time for the gap merging phase is significant.  

Future Work

It would be very nice to extend the automatic tolerance setting to the Maximum Inscribed Circle computation.  More research is required to see whether this can be done.

Wednesday, 30 April 2025

Coverage Cleaning in JTS

The JTS Topology Suite has been rolling out the capability to manage polygonal coverages.  It supports modelling polygonal coverages as arrays of discrete polygonal geometries.  This is simple to work with and allows using all of the wide variety of JTS algorithms.  Coverage topology allows operations such as CoverageSimplifier and CoverageUnion to be highly performant and effective.

The key requirement for coverage operations is that the input is a valid polygonal coverage.  JTS provides the CoverageValidator to check a dataset to see if it is a valid coverage, and provide full information on the location of topology errors.  However, this leaves the question of how to convert a dataset with errors into a valid coverage.  

There are two options for fixing coverages: manual editing, and automated cleaningCoverageValidator provides information to identify the location of all errors. However, manually fixing them is time-consuming, and difficult in environments which lack convenient editing tools. And of course it simply isn't feasible in automated workflows.

What users really want is automated coverage cleaning.  This is a major pain point (witness the numerous questions about how to do this; for example, this and this and this).  Clearly, cleaning coverages is considered critical!

There are several existing tools to do this:

  • mapshaper is perhaps the most popular tool currently.  It supports snapping to improve dataset quality, along with overlap and gap merging.  It is written in Javascript, which provides surprisingly good performance, but makes it awkward to integrate in other tools.
  • pprepair is an academically-advanced algorithm which uses triangulation as the basis for cleaning.  It relies on CGAL, which limits appeal for integration
  • GRASS v.clean is a long-standing cleaning tool with an extensive set of options for carrying out various kinds of cleaning

But in the JTS/GEOS ecosystem it's been a gap that needs filling (so to speak).  So I'm excited to announce the the arrival of the JTS CoverageCleaner.

JTS CoverageCleaner class

The JTS CoverageCleaner class provides the essential capabilities of cleaning polygonal data to form a valid coverage:

  • Snapping: snapping vertices and lines eliminates small discrepancies and narrow gaps, and improves the robustness of subsequent processing
  • Overlap Merging: of two or more polygons are merged with a neighbouring polygon.  The merge target can be chosen to be the neighbour with the longest shared border, the largest or smallest area, or the one with minimum input index (which allows priority-based merging).
  • Gap Detection and Merging: Gaps are defined as enclosed empty areas with width less than a distance tolerance.  Width is computed as the diameter of the Maximum Inscribed Circle of the gap polygon. (This is an improvement over other tools, which only offer cruder area or roundness-metric tests to identify gaps to be merged.) Gaps can be filled and merged with an adjacent polygon. 
The output of the cleaner is a valid polygonal coverage that contains no overlaps and no gaps narrower than the given tolerance.  The merged overlaps and gaps are also available for visualization and QA purposes.

Other aspects of cleaning data are provided by separate JTS classes.  Invalid polygonal geometries can be preprocessed with GeometryFixer to repair them.  CoverageSimplifier can be used to reduce the vertex size of cleaned datasets.  And a forthcoming coverage precision reducer will allow controlling the amount of precision maintained (which can reduce the space required for some storage formats).

Algorithm Description

This is a good example of how the breadth of JTS capabilities makes it easier to develop new algorithms:

  • the SnappingNoder was developed for OverlayNG, and turned out to be ideal for eliminating minor errors and providing a robust basis for the rest of the process
  • the Polygonizer allows fast recreation of a coverage topology from the noded linework
  • Spatial indexes (STRtree and Quadtree) and the fast predicates of RelateNG provide performance for operating in the discrete geometric domain
  • the new fast MaximumInscribedCircle.isRadiusWithin predicate allows using polygon width for gap detection
The cleaning algorithm works as follows:

Input

The input is an array of valid polygonal geometries.  This is the same structure used in other coverage operations.  However, it may or may not not have valid coverage topology.  The array position acts as an implicit identifier for the coverage elements.

Snapping and Noding

Input linework is extracted and noded using the SnappingNoder.  Snapping vertices and lines using a distance tolerance eliminates small discrepancies and narrow gaps, and improves the robustness of noding.  By default a very small snapping distance is used (based on the diameter of the dataset.)  The user can specifiy a custom snapping distance. Larger distances can be more effective at cleaning the linework.  However, too large a distance can introduce unwanted alteration to the data.

Polygonizing and Resultant Categorization

The noded linework is polygonized using the Polygonizer. This forms a clean topology of resultant polygons.  Resultant polygons are associated with their parent input polygon(s) via their interior points and point-in-polygon tests. Resultants are categorized according to their number of parents:
  • Result area:  one parent (which identifies the associated input polygon)
  • Overlap: multiple parents
  • Gap: no parents

Overlap Merging

Overlaps are always merged, since they violate coverage topology.  They are merged with an adjacent result area chosen via a specified merge strategy.  The strategy indicates whether the merge target is chosen to be the neighbour with the longest shared border, the largest or smallest area, or the one with minimum input index (which allows priority-based merging).

Gap Detection and Merging

Gaps in the created coverage do not invalidate coverage topology, but they are often an artifact of invalidities, or may be errors in the original linework.  For this reason a distance tolerance can be used to select gaps to be merged.  Gaps which are narrower than the width tolerance are merged with the neighbour result area with longest shared border.  The width of the gap polygon is determined by the diameter of the MaximumInscribedCircle.  (This is an improvement over other tools, which only provide cruder area or roundness-metric tests to identify gaps.)
Identifying narrow gaps to be merged via Maximum Inscribed Circle radius

Output

The output is an array of the same size as the input array.  Each element is the cleaned version of the corresponding input element, and the entire array forms a valid polygonal coverage.  Due to snapping very small inputs may disappear.  The output may still contain gaps larger than the gap width tolerance used.  (One way to to detect these is to union the coverage - the result will have holes where gaps exist.) Cleaning can be re-run with a larger tolerance to remove more gaps.

Examples

Given the number of tools available to clean polygonal coverages, it's slightly surprising that you don't have to look hard to find datasets containing coverage topology errors.  But that's good for testing!  Here are some examples demonstrating the behaviour and performance of CoverageCleaner.

Example 1: Montana County Commissioner Districts

This is one of the jurisdiction coverages available here.  The dataset contains 177 areas with 470,229 vertices.  Running CoverageValidator finds errors at over 12,000 locations:
Zooming in on an area containing errors using the JTS TestBuilder Magnify Topology tool shows that errors are caused by small misalignments in the linework of adjacent polygons, creating both gaps and overlaps:

Running the Coverage Cleaner with the default snapping tolerance and a gap width of 10 produces a clean coverage in 0.88 seconds.  The effect of snapping removes 2043 overlaps and 1793 gaps.  After topology building, merging removes a further 14 overlaps and 12 gaps. (This is determined by running the cleaning process with both default snapping tolerance and a tolerance of 0, and counting the merged overlaps and gap polygons available from the cleaner instance.)

Example 2: Communes in France Department 71

This example is a dataset representing communes in the French department 71 (Saône-et-Loire).  It may have been assembled by aggregating cadastral parcels, which is a common cause of discrepancies due to different data origins.  The dataset contains 564 areas with 612,238 vertices.  Validating the coverage reveals over 362,000 error locations (and in fact this doesn't represent all of the gaps present).

Zooming shows some example overlaps and gaps. Magnify Topology is not needed since the discrepancies are on the order of metres in size.

Running the Coverage Cleaner with default snapping tolerance and a Gap Width of 50 produces a clean coverage in 23 seconds.  The slower performance over Example 1 is due to the much larger number of gaps and overlaps: 29,307 overlaps and 28,409 gaps.  They are almost all too wide to be handled by snapping, so they must be merged, which requires more processing.

Here's a before-and-after view of an area of the cleaned coverage, showing the effect of merging gaps and overlaps:



Gap Merging Quality Issues

Given the size and number of issues in the France-71-communes dataset, it's not surprising that it exhibits two known problems with the current gap merging strategy: spikes creating by merged gaps, and unmerged narrow gap parts.

Here is an example of how merging a gap to an adjacent polygon can create a spike in the result polygon.  (Merging overlaps does not produce these kinds of errors, since the overlap was part of the merge target originally.)

This is an example of a wide gap which is not merged, but which has narrow portions which could be "clipped off" and merged:

Improving the quality of gap merging is an area for further research and development.

GEOS and PostGIS Ports

As usual, this work will be ported to GEOS soon, and then exposed in PostGIS as a new coverage function,  It should also show up in downstream projects such as Shapely and hopefully QGIS.

Further Enhancements

Overlap and Gap Handling

There are other possibilities for controlling overlap merging and gap filling.  Choosing merge targets could be determined by a priority value on each input area.  Gaps could be determined by area or roundness measure criteria, as provided in other systems (although they seem less precise than using polygon width.)     

Gap Partitioning and Spike Removal 

Gaps which are very long can extend along multiple polygons. If a gap of this nature is merged to single polygon it will create a spike in the merged polygon.  Instead, it would be better to partition the gap into multiple pieces at polygon nodes and merge the parts separately.

A more complex situation is when a gap contains both narrow and wide portions.  An example of this is a parcel coverage with blocks separated by narrow streets and wider intersections and squares.  Ideally the narrow parts could be partitioned off and merged, leaving the wide portions unmerged.  

Gore Removal

Gores are narrow "inlets" along the border of in a coverage.   They have a similar shape to gaps, but since they are not closed they are not identified as gaps by the cleaning process.  They could be identified by a different process, closed off, and then merged (ideally with partitioning as discussed above).

Coverage precision reduction

A common requirement in data management is to reduce the precision of data coordinates.  For coverages this cannot be done piecewise, but must be performed as a global operation over the entire coverage, since rounding coordinates independently can create line intersections.  The SnapRoundingNoder provides an effective way of doing this.