Wednesday, 26 May 2021

JTS IsSimple gets simpler (and faster)

Hard to believe that the JTS Topology Suite is almost 20 years old.  That's 140 in dog years!  Despite what they say about old dogs, one of the benefits of longevity is that you have the opportunity to learn a trick or two along the way.  One of the key lessons learned after the initial release of JTS is that intersection (node) detection is a fundamental part of many spatial algorithms, and critical in terms of performance.  This resulted in the development of the noding package to provide an API supporting many different kinds of intersection detection and insertion.

Prior to this, intersection detection was performed as part of the GeometryGraph framework, which combined it with topology graph formation and analysis.  At the time this seemed like an elegant way to maximize code reuse across many JTS operations, including overlay, buffering, spatial predicates and validation.  But as often the case, there are significant costs to such general-purpose code:

  • The overall codebase is substantially more complex
  • A performance penalty is imposed on algorithms which don't require topology construction
  • Algorithms are harder to read and understand.  
  • The code is brittle, and so hard to modify
  • Porting the code is more difficult  
Because of this, a focus of JTS development is to free operations from their dependency on GeometryGraph - with the ultimate goal of expunging it from the JTS codebase.  A major step along this road was the rewrite of the overlay operations.

Another operation that relies on GeometryGraph is the IsSimpleOp class, which implements the OGC Simple Features isSimple predicate.  The algorithm for isSimple essentially involves determining if the geometry linework contains a self-intersection.  GeometryGraph is unnecessarily complex for this particular task, since there is no need to compute the entire topology graph in order to find a single self-intersection.  Reworking the code to use the the MCIndexNoder class in the noding API produces a much simpler and more performant implementation. I also took the opportunity to move the code to the operation.valid package, since the operations of isSimple and isValid are somewhat complementary.

Now, isSimple is probably the least-used OGC operation.  Its only real use is to test for self-intersections in lines or collections of lines, and that is not a critical issue for many workflows.  However, there is one situation where it is quite useful: testing that linear network datasets are "vector-clean" - i.e. contain LineStrings which touch only at their endpoints.

A linear network containing non-simple intersections (isSimple == false)

To demonstrate the performance improvement, I'll use a dataset for Rivers of the US maintained by the US National Weather Service.  It supplies two datasets: a full dataset of all rivers, and a subset of major rivers only.  You might expect a hydrographic network to be "vector-clean", but in fact both of these datasets contain numerous instances of self-intersections and coincident linework.

Here's the results of running the isSimple predicate on the datasets.  On the larger dataset the new implementation provides a 20x performance boost!

 Dataset New time  Old time 
  Subset  (909,865 pts)   0.25 s   1 s
  Full   (5,212,102 pts)  2 s  30 s

Finding Non-Simple Locations

The new codebase made it easy to add a functionality enhancement that computes the locations of all places where lines self-intersect.  This can be used to for visual confirmation that the operation is working as expected, and to indicate places where data quality needs to be improved.  Here's the non-simple intersection points found in the river network subset:

Closeups of some non-simple intersection locations:

IsSimpleOp is the easiest algorithm to convert over from using GeometryGraph.  As such it serves as a good proof-of-viability, and establishes useful code patterns for further conversions. 

Next up is to give IsValidOp the same treatment.  This should provide similar benefits of simplicity and performance.  And as always, porting the improved code to GEOS.

Monday, 3 May 2021

Fixing Invalid Geometry with JTS

TLDR: JTS can now fix invalid geometry!

The JTS Topology Suite implements the Geometry model defined in the OGC Simple Features specification.  An important part of the specification is the definition of what constitutes valid geometry.  These are defined by rules about the structural and geometric characteristics of geometry objects.  Some validity rules apply to all geometry; e.g. vertices must be defined by coordinates with finite numeric values (so that NaN and Inf ordinates are not valid).  In addition, each geometric subtype (Point, LineString, LinearRing, Polygon, and Multi-geometry collections) has its own specific rules for validity.

The rules for Polygons and MultiPolygons are by far the most restrictive.  They include the following constraints:

  1. Polygons rings must not self-intersect
  2. Rings may touch at only a finite number of points, and must not cross
  3. A Polygon interior must be connected (i.e. holes must not split a polygon into two parts)
  4. MultiPolygon elements may touch at only a finite number of points, and must not overlap

These rules guarantee that:

  • a given area is represented unambiguously by a polygonal geometry
  • algorithms operating on polygonal geometry can make assumptions which provide simpler implementation and more efficient processing

Valid polygonal geometry is well-behaved

Given the highly-constrained definition of polygonal validity, it is not uncommon that real-world datasets contain polygons which do not satisfy all the rules, and hence are invalid. This occurs for various reasons:

  • Data is captured using tools which do not check validity, or which use a looser or different definition than the OGC standard
  • Data is imported from systems with different polygonal models
  • Data is erroneous or inaccurate 

Because of this, JTS does not enforce validity on geometry creation, apart from a few simple structural constraints (such as rings having identical first and last points).  This allows invalid geometry to be represented as JTS geometry objects, and processed using JTS code.  Some kinds of spatial algorithms can execute correctly on invalid geometry (e.g. determining the convex hull).  But most algorithms require valid input in order to ensure correct results (e.g. the spatial predicates) or to avoid throwing exceptions (e.g. overlay operations).  So the main reason for representing invalid geometry is to allow validity to be tested, to take appropriate action on failure.

Often users would like "appropriate action" to be Just Make It Work.  This requires converting invalid geometry to be valid.  Many spatial systems provide a way to do this: 

But this has a been a conspicuous gap in the JTS API.  While it is possible to test for validity, there has never been a way to fix an invalid geometry.  To be fair, JTS has always had an unofficial way to make polygonal geometry valid.  This is the well-known trick of computing geometry.buffer(0), which creates a valid output which often is a good match to the input. This has worked as a stop-gap for years (in spite of an issue which caused some problems, now fixed - see the post Fixing Buffer for fixing Polygons). However, using buffer(0) on self-intersecting "figure-8" polygons produces a "lossy" result.  Specifically, it retains only the largest lobes of the input linework.  This is undesirable for some uses (although it is advantageous in other situations, such as trimming off small self-intersections after polygon simplification).

Buffer(0) of Figure-8 is lossy 

So, it's about time that JTS stepped up to provide a supported, guaranteed way of fixing invalid geometry. This should handle all geometry, although polygonal geometry repair is the most critical requirement.

This raises the question of what exactly the semantics of repairing polygons should be.  While validity is well-specified, there are no limits to the complexity of invalid polygons, and a variety of possible approaches to fixing them.  The most significant decision is how to determine the interior and exterior of a polygonal geometry with self-intersections or overlaps. (This is the classic "bow-tie" or "figure-8" - although self-intersecting polygons can be far more complex.) The question comes down to whether the geometry linework or structure is used to determine interior areas.  

If linework is used to create validity, to node the constituent linework to form a topologically-valid coverage.  This coverage is then scanned with an alternating even-odd strategy to assign areas as interior or exterior.  This may result in adjacent interior or exterior areas, in which case these are merged.

Alternatively, the structure of the polygonal geometry can be taken as determinative. The shell and hole rings are assumed to accurately specify the nature of the area they enclose (interior or exterior).  Likewise, the (potentially overlapping or adjacent) elements of a MultiPolygon are assumed to enclose interior area.  The repair operation processes each ring and polygon separately.  Holes are subtracted from shells.  Finally, if required the repaired polygons are unioned to form the valid result.  

PostGIS MakeValid and the ESRI Java Geometry API makeSimple both use the linework approach.  However, for some relatively simple invalid geometries this produces results which seem overly complex.  
Complex output from ST_MakeValid

For this reason I have implemented the structure-based approach in JTS.  It provides results that are closer to the existing buffer(0) technique (and conveniently allows using the existing buffer code).  This made it a relatively simple matter to implement the repair algorithm as the GeometryFixer class.

Here's some examples of how GeometryFixer works.  First, the example above, showing the (arguably) simpler result that arises from using the structure information:
Figure-8s are handled as desired (keeping all area):

Self-overlapping shells have all interior area preserved:
Of course, the GeometryFixer also handles simple fixes for all geometry types, such as removing invalid coordinates.

One further design decision is how to handle geometries which are invalid due to collapse (e.g. a line with a single point, or a ring which has only two unique vertices).  GeometryFixer provides an option to either remove collapses, or to return them as equivalent lower dimensional geometry. 

To see the full range of effects of GeometryFixer, the JTS TestBuilder can be used to view and run the GeometryFixer on the set of 75 test cases for invalid polygonal geometry in the file TestInvalidA.xml

It's been a long time coming, but finally JTS can function as a full-service repair shop for geometry, no matter how mangled it might be.


Wednesday, 28 April 2021

JTS goes to Mars!

The other day this badge popped up on my GitHub profile:

Sure enough, there is JTS in the list of dependencies provided by the Jet Propulsion Laboratory (JPL). 

So take that, Elon - JTS got there first!  With a lot of help from NASA, of course. 

Oh how nice, the Earthlings are giving me some target practice!

Tuesday, 2 February 2021

OverlayNG and Invalid Geometry

A recent blog post by Elephant Tamer gives a critical appraisal of the improvements to overlay processing shipped in PostGIS 3.1 with GEOS 3.9.  The author is disappointed that PostGIS still reports errors when overlay is used on invalid geometry.  However, this is based on a misunderstanding of the technology. 

GEOS 3.9 includes OverlayNG, ported from the JTS Topology Suite). It brings a major advance in overlay robustness, along with other improvements (described here and here).  Previously, robustness limitations in the overlay algorithm could sometimes cause errors even for inputs which were topologically valid.  This was doubly problematic because there was no fully effective way to process the input geometries to avoid the errors.  Now, OverlayNG solves this problem completely.  Valid inputs will always produce a valid and essentially correct(*) output.

(*) "Essentially" correct, because in order to achieve full robustness a snapping heuristic may be applied to the input geometry.  However, this is done with a very fine tolerance, so should not appreciably alter the output from the theoretically correct value.

But for invalid inputs, OverlayNG will still report errors.  The reason is that there is a wide variety of gruesome ways in which geometry can be invalid.  Automated handling of invalidity would involve expensive extra processing, and also require making assumptions about what area a corrupt geometry is intended to represent.  Rather than silently repairing invalid geometry and returning potentially incorrect results, the design decision is to report this situation as an error.

In fact, OverlayNG is able to handle "mildly" invalid polygons, as described in this post. This covers situations which are technically invalid according to the OGC SFS specification, but which still have well-defined topology.  This includes self-touching rings (sometimes called "inverted polygons" or "exverted holes"), and zero-width gores and spikes.

Taking a detailed look at the data used in the blog post, we can see these improvements at work.  The dataset is the ecology polygons obtained from the GDOS WFS server.  This contains 7662 geometries, of which 10 are invalid.  Using the old overlay algorithm, 9 of these invalid polygons cause TopologyException errors.  Using OverlayNG, only 4 of them cause errors. 

The polygons that can now be processed successfully are typical "OGC-invalid" situations, which do not materially affect the polygonal topology.  These include self-touching rings with pinch points:

and zero-width gores:

Of the cases that still cause errors, two are classic small bow-tie errors:

And two are wildly invalid self-crossing rings:

The last two are good examples of geometry which is so invalid that it is impossible to unambiguously decide what area is represented (although ST_MakeValid will happily grind them into something that is technically valid).

Ultimately it is the user's responsibility to ensure that geometries to be processed by overlay (and many other PostGIS functions) have valid topology (as reported by ST_IsValid).  Ideally this is done by correcting the data at source.  But it can also be done a posteriori in the database itself, by either the ST_MakeValid function, or the well-known buffer(0) trick. (Which to use is a topic for another blog post...)

One improvement that could be made is to check for input validity when OverlayNG throws an error.  Then PostGIS can report definitively that an overlay error is caused by invalid input.  If there is an overlay error that is not caused by invalidity, the PostGIS team wants to hear about it! 

And perhaps there is a case to be made for repairing invalid geometry automatically, even if the repair is suspect.  Possibly this could be invoked via a flag parameter on the overlay functions. More research is required - feedback is welcome!

Wednesday, 13 January 2021

Using the GEOS geosop CLI

In a previous post I announced the new geosop command-line interface (CLI) for the GEOS geometry API.  This post provides examples of using geosop for various tasks. (Refer to the README for more information about the various options used.)

API Testing

geosop makes testing the GEOS library much easier.  Previously, testing the behaviour of the API usually required breaking out the C compiler (and updating autotools and cmake build files, and deciding whether to commit the new code for later use or throw it away, etc, etc).  Now testing is often just a matter of invoking a geosop operation on appropriate data, or at worst adding a few lines of code to the exiting framework.

For example, there is a long-standing issue with how GEOS handles number formatting in WKT output.  There are recent bug reports about this in GeoSwift and the Julia LibGEOS.  geosop makes it easy to run the test cases and see the less-than-desirable output:

geosop -a "POINT (654321.12 0.12)" -f wkt
POINT (654321.1199999999953434 0.1200000000000000)

geosop -a "POINT (-0.4225977234 46.3406448)" -f wkt
POINT (-0.4225977234000000 46.3406447999999997)

There's also an issue with precision handling.  To test this we added a --precision parameter to geosop. (This is the kind of rapid development enabled by having the CLI codebase co-resident with the API.)

geosop -a "POINT (654321.126 0.126)" --precision 2 -f wkt
POINT (6.5e+05 0.13)

Again we see undesirable behaviour. Using scientific notation for small numbers is unnecessary and difficult to read. And the precision value is determining the number of significant digits, not the number of decimal places as intended by the GEOS WKTWriter.setRoundingPrecision API.

These were all caused by using standard C/C++ numeric formatting, which is surprisingly limited and non-useful. After some fine work by Paul Ramsey to integrate the much better Ryu library, GEOS now has WKT output that is sensible and handles precision in a useful way.

By default, WKTWriter now nicely round-trips WKT text:

geosop -a "POINT (654321.12 0.12)" -f wkt
POINT (654321.12 0.12)

geosop -a "POINT (-0.4225977234 46.3406448)" -f wkt
POINT (-0.4225977234 46.3406448)

If WKTWriter.setRoundingPrecision or GEOSWKTWriter_setRoundingPrecision is called, the precision value applies to the decimal part of the number: 

geosop -a "POINT (654321.1234567 0.126)" --precision 0 -f wkt
POINT (654321 0)

geosop -a "POINT (654321.1234567 0.126)" --precision 2 -f wkt
POINT (654321.12 0.13)

geosop -a "POINT (654321.1234567 0.126)" --precision 4 -f wkt
POINT (654321.1235 0.126)

geosop -a "POINT (654321.1234567 0.126)" --precision 6 -f wkt
POINT (654321.123457 0.126)

Performance Testing

A key use case for geosop is to provide easy performance testing.  Performance of geometric operations is highly data-dependent.  It's useful to be able to run operations over different datasets and measure performance.  This allows detecting performance hotspots, and confirming the efficiency of algorithms.

As a simple example of performance testing, for many years GEOS has provided optimized spatial predicates using the concept of a prepared geometry.  Prepared geometry uses cached spatial indexes to dramatically improve performance for repeated spatial operations against a geometry.  Here is a performance comparison of the intersects spatial predicate in its basic and prepared form. 

geosop -a world.wkt -b world.wkt -t intersects
Ran 59,536 intersects ops ( 179,072,088 vertices)
  -- 16,726,188 usec    (GEOS 3.10.0dev)

geosop -a world.wkt -b world.wkt -t intersectsPrep
Ran 59,536 intersectsPrep ops ( 179,072,088 vertices)
  -- 1,278,348 usec    (GEOS 3.10.0dev)

The example of testing the world countries dataset against itself is artificial, but it shows off the dramatic 16x performance boost provided by using a prepared operation. 

Another interesting use is longitudinal testing of GEOS performance across different versions of the library.  For instance, here's a comparison of the performance of interiorPoint between GEOS 3.7 and 3.8.  The interiorPoint algorithm in GEOS 3.7 relied on overlay operations, which made it slow and sensitive to geometry invalidity. GEOS 3.8 included a major improvement which greatly improved the performance, and made the algorithm more robust.  Note that the dataset being used contains some (mildly) invalid geometries towards its end, which produces an error in GEOS 3.7 if the entire dataset is run.  The  --alimit 3800 option limits the number of geometries processed to avoid this issue.

GEOS 3.7

geosop -a ne_10m_admin_1_states_provinces.wkb --alimit 3800 -t interiorPoint
Ran 3,800 operations ( 1,154,703 vertices)  -- 2,452,540 usec  (GEOS 3.7)

GEOS 3.8

geosop -a ne_10m_admin_1_states_provinces.wkb --alimit 3800 -t interiorPoint
Ran 3,800 operations ( 1,154,703 vertices)  -- 35,665 usec  (GEOS 3.8)

The dramatic improvement in interiorPoint performance is clearly visible.


The raison d'etre of GEOS is to carry out geoprocessing.  Most users will likely do this using one of the numerous languages, applications and databases that include GEOS.  But it is still useful, convenient, and perhaps more performant to be able to process geometric data natively in GEOS.  

The design of geosop enables more complex geoprocessing via chaining operations together using shell pipes.  For example, here is a process which creates a Voronoi diagram of some points located in the British Isles, and then clips the Voronoi polygons to the outline of the islands.  This also shows a few more capabilities of geosop:
  • input can be supplied as WKT (or WKB) geometry literals on the command-line
  • input can be read from the standard input (here as WKB)
  • the output data is sent to the standard output, so can be directed into a file

  -a "MULTIPOINT ((1342 1227.5), (1312 1246.5), (1330 1270), (1316.5 1306.5), (1301 1323), (1298.5 1356), (1247.5 1288.5), (1237 1260))" 
  -f wkb voronoi 
| geosop -a stdin.wkb -b uk.wkt -f wkt intersection 
> uk-vor.wkt

Tuesday, 12 January 2021

Introducing geosop - a CLI for GEOS

The GEOS geometry API is used by many, many projects to do their heavy geometric lifting.  But GEOS has always had a bit of a PR problem.  Most of those projects provide a more accessible interface to perform GEOS operations.  Some offer a high-level language like Python, R, or SQL (and these typically come with a REPL to make things even easier).  Or there are GUIs like QGIS, or a command-line interface (CLI) like GDAL/OGR.  

But you can't do much with GEOS on its own.  It is a C/C++ library, and to use it you need to break out the compiler and start cutting code. It's essentially "headless".  Even for GEOS developers, writing an entire C program just to try out a geometry operation on a dataset is painful, to say the least.  

There is the GEOS XMLTester utility, of course.  It processes carefully structured XML files, but that is hardly convenient.  (And in case this brings to mind a snide comment like "2001 called and wants its file format back", XML actually works very well in JTS and GEOS as a portable and readable format for geometry tests.  But I digress.)

JTS (on which GEOS is based) has the TestBuilder GUI, which works well for testing out and visualizing the results of JTS operations.  JTS also has a CLI called JtsOp. Writing a GUI for GEOS would be a tall order.  But a command-line interface (CLI) is much simpler to code, and has significant utility.  In fact there is an interesting project called geos-cli that provides a simple CLI for GEOS.  But it's ideal to have the CLI code as part of the GEOS project, since it ensures being up-to-date with the library code, and makes it easy to add operations to test new functionality.

This need has led to the development of geosop.  It is a CLI for GEOS which performs a range of useful tasks:

  • Run GEOS operations to confirm their semantics
  • Test the behaviour of GEOS on specific geometric data
  • Time the performance of operation execution
  • Profile GEOS code to find hotspots
  • Check memory usage characteristics of GEOS code
  • Generate spatial data for use in visualization or testing
  • Convert datasets between WKT and WKB

    geosop has the following capabilities:

    • Read WKT and WKB from files, standard input, or command-line literals
    • Execute GEOS operations on the list(s) of input geometries.  Binary operations are executed on every pair of input geometries (i.e. the cross join aka Cartesian product)
    • Output geometry results in WKT or WKB (or text, for non-geometric results)
    • Display the execution time of data input and operations
    • Display a full log of the command processing
    Here's a look at how it works.  

    geosop -h gives a list of the options and operations available:

    geosop - GEOS v. 3.10.0dev
    Executes GEOS geometry operations
      geosop [OPTION...] opName opArg

      -a arg               source for A geometries (WKT, WKB, file, stdin,
      -b arg               source for B geometries (WKT, WKB, file, stdin,
          --alimit arg     Limit number of A geometries read
      -c, --collect        Collect input into single geometry
      -e, --explode        Explode result
      -f, --format arg     Output format
      -h, --help           Print help
      -p, --precision arg  Sets number of decimal places in WKT output
      -r, --repeat arg     Repeat operation N times
      -t, --time           Print execution time
      -v, --verbose        Verbose output

      area A - computes area for geometry A
      boundary A - computes boundary for geometry A
      buffer A N - cmputes the buffer of geometry A
      centroid A - computes centroid for geometry A
      contains A B - tests if geometry A contains geometry B
      containsPrep A B - tests if geometry A contains geometry B, using PreparedGeometry
      containsProperlyPrep A B - tests if geometry A properly contains geometry B using PreparedGeometry
      convexHull A - computes convexHull for geometry A
      copy A - computes copy for geometry A
      covers A B - tests if geometry A covers geometry B
      coversPrep A B - tests if geometry A covers geometry B using PreparedGeometry
      difference A B - computes difference of geometry A from B
      differenceSR A B - computes difference of geometry A from B rounding to a precision scale factor
      distance A B - computes distance between geometry A and B
      distancePrep A B - computes distance between geometry A and B using PreparedGeometry
      envelope A - computes envelope for geometry A
      interiorPoint A - computes interiorPoint for geometry A
      intersection A B - computes intersection of geometry A and B
      intersectionSR A B - computes intersection of geometry A and B
      intersects A B - tests if geometry A and B intersect
      intersectsPrep A B - tests if geometry A intersects B using PreparedGeometry
      isValid A - tests if geometry A is valid
      length A - computes length for geometry A
      makeValid A - computes makeValid for geometry A
      nearestPoints A B - computes nearest points of geometry A and B
      nearestPointsPrep A B - computes nearest points of geometry A and B using PreparedGeometry
      polygonize A - computes polygonize for geometry A
      reducePrecision A N - reduces precision of geometry to a precision scale factor
      relate A B - computes DE-9IM matrix for geometry A and B
      symDifference A B - computes symmetric difference of geometry A and B
      symDifferenceSR A B - computes symmetric difference of geometry A and B
      unaryUnion A - computes unaryUnion for geometry A
      union A B - computes union of geometry A and B
      unionSR A B - computes union of geometry A and B

    Most GEOS operations are provided, and the list will be completed soon.

    Some examples of using geosop are below.
    • Compute the interior point for each country in a world polygons dataset, and output them as WKT:
    geosop -a world.wkt -f wkt interiorPoint

    • Determine the time required to compute buffers of distance 1 for each country in the world:
    geosop -a world.wkt --time buffer 1

    • Compute the union of all countries in Europe:
    geosop -a europe.wkb --collect -f wkb unaryUnion  

     The README gives many more examples of how to use the various command-line options.  In a subsequent post I'll give some demonstrations of using geosop for various tasks including GEOS testing, performance tuning, and geoprocessing.

    Future Work

    There's potential to make geosop even more useful: 
    • GeoJSON is a popular format for use in spatial toolchains.  Adding GeoJSON reading and writing would allow geosop to be more widely used for geo-processing.  
    • Adding SVG output would provide a way to visualize the results of GEOS operations.   
    • Improve support for performance testing by adding operations to generate various kinds of standard test datasets (such as point grids, polygon grids, and random point fields).  
    And of course, work will be ongoing to keep geosop up-to-date as new operations and functionality are added to GEOS.

    Wednesday, 23 December 2020

    Fixing Buffer for fixing Polygons

    The OGC Simple Features specification implemented by JTS has strict rules about what constitutes a valid polygonal geometry.  These include:

    • Polygon rings must be simple; i.e. they may not touch or cross themselves
    • MultiPolygon elements may not overlap or touch at more than a finite number of points (i.e they may not intersect along an edge)
    These rules were chosen for good reason. They ensure that OGC-valid polygonal geometry is the simplest possible representation of an enclosed area.  This greatly simplifies the evaluation of most operations on polygonal geometry, which leads to improved performance.  JTS operations generally require input which is valid according to OGC rules.  And they always (with some rare exceptions) emit result geometry which is OGC-valid.

    But data in the wild is often not this well-behaved.  This creates the need to "clean" or "make valid"  polygonal geometry in order to carry out operations on it.  Shortly after JTS was first released we discovered a useful trick: constructing a zero-width buffer via geom.buffer(0) converts an invalid polygonal geometry into a valid one.  It can also be used as a simple way of converting "inverted" polygon topology (ESRI-style) into valid OGC topology.  The reason this works is that the buffer algorithm inherently has to handle overlaps and self-intersections since they often occur during the generation of raw buffer offset curves.  The algorithm nodes self-intersections, merges overlaps, and creates new polygons or holes if necessary.

    A polygon with many invalidities: overlap, self-touch in point and line, and a "bow-tie".

    The polygon fixed by using buffer(0)
    Note that the bow-tie portion on the right is considered to lie in the exterior of the polygon due to ring orientation, and thus is removed.

    In the 20 years since the release of JTS (and its derivative GEOS) this trick has passed into the lore of open-source spatial data processing. It has become a recommended technique for fixing invalid polygonal geometry in numerous projects, such as PostGISShapelyRGeoGeoToolsR-sf and QGIS. It's also used internally in JTS, in algorithms such as DouglasPeuckerSimplifier, VWSimplifier, and Densifier which might otherwise produce invalid polygonal results.

    BUT - there's a nasty little surprise lying in wait for users of buffer(0)! It doesn't always work.  It turns out that the buffer algorithm has a serious flaw: in some situations involving invalid "bow-tie" topology it will discard a large part of the input geometry.  This has been reported in quite a few issues (here and here in JTS, and also in  GEOS and Shapely).   
    Result of running DouglasPeuckerSimplifier on a polygon with a bow-tie. (See issue)

    Close-up of the result -  clearly undesirable.

    The problem occurs because the buffer algorithm computes the orientation of rings in order to build the buffer offset curve (in the case of a zero-width buffer this is just the original ring linework). Currently the Orientation.isCCW test is used to do this. This uses an efficient algorithm that determines ring orientation by checking the line segments incident on the uppermost vertex of the ring (see Wikipedia for an explanation of why this works.) For a valid ring (where the linework does not cross itself) this works perfectly. However, in a invalid self-crossing ring (sometimes called a "bow-tie" or "figure-8") a choice must be made about which lobe is assigned to be the "interior". The upper-vertex approach always picks the top lobe.  If that happens to be very small, the larger part of the ring is considered "exterior" and hence is removed by buffering.  

    A bow-tie polygon where buffer makes the evidently wrong choice for interior.
    The problem occurs for non-zero buffer distances as well.

    This problem has limped along for many years now, never being quite enough of a pain point to motivate the effort needed to find a fix (or to fund one!).  And to be honest, the buffer code is some of the most complicated and delicate in JTS, and I was concerned about wading into it to add what seemed poised to be a fiddly correction.

    But recently there has been renewed interest in providing a Make-Valid capability for JTS.  This inspired me to revisit the usage of buffer(0), and think more deeply about ring orientation and its role in determining valid polygonal topology. And this led to discovering a surprisingly simple solution for the buffer issue.

    The fix is to use an orientation test which takes into account the entire ring. This is provided by the Signed-Area Orientation test, implemented in Orientation.isCCWArea using the Shoelace Formula. This effectively determines orientation based on the largest area enclosed by the ring. This corresponds more closely to user expectation based on visual assessment. It also minimizes the change in area and (usually) extent.

    And indeed it works:

    It fixes the simplification issue nicely:

    The fix consists of about 4 lines of actual code.  To paraphrase a great orator, never in the history of JTS has so much benefit been given to so many by so few lines of code.  Now buffer(0) can be recommended unreservedly as an effective, performant way to fix polygonal geometry.  And all those helpful documentation pages can drop any qualifications they might have.

    As usual, this fix will soon show up in GEOS, and from there in PostGIS and other downstream projects.  

    This isn't the end of the story.  There are times when the effect of buffer(0) is not what is desired for fixing polygon topology.  This is discussed nicely in this blog post.  The ongoing research into Make Valid will explore alternatives and how to provide an API for them.

    Friday, 18 December 2020

    Randomization to the Rescue!

    Now that OverlayNG has landed on JTS master, it is getting attention from downstream projects interested in porting it or using it.  One of the JTS ports is the Net Topology Suite (NTS) project, and it is very proactive about tracking the JTS codebase.  Soon after the OverlayNG commit an NTS developer noticed an issue while running the performance tests in JTS: the InteriorPointInAreaPerfTest was now causing a StackOverflowError.  This turned out to be caused by the change to GeometryPrecisionReducer to use OverlayNG with Snap-Rounding to perform robust precision reduction of polygons.  Further investigation revealed that failure was occurring while querying the KdTree used in the HotPixelIndex in the SnapRoundingNoder.  Moreover, in addition to the outright failure, even when queries did succeed they were taking an excessively long time for large inputs.

    The reason for using a K-D tree as the search structure for HotPixels is that it supports two kinds of queries with a single structure: 

    • Lookup queries to find the HotPixel for a coordinate.  These queries are performed while building the index incrementally. Hence a dynamic data structure like a K-D tree is required, rather than a static index such as STRtree.
    • Range queries to find all HotPixels within a given extent.  These queries are run after the index is loaded.

    The JTS KdTree supports both of these queries more efficiently than QuadTree, which is the other dynamic index in JTS.  However, K-D trees are somewhat notorious for becoming unbalanced if inserted points are coherent (i.e. contain monotonic runs in either ordinate dimension).  This is exactly the situation which occurs in large, relatively smooth polygons - which is the test data used in InteriorPointInAreaPerfTest.  An unbalanced tree is deeper than it would be if perfectly balanced.  In extreme cases this leads to trees of very large depth relative to their size.  This slows down query performance and, because the KdTree query implementation uses recursion, can also lead to stack overflow.  

    A perfectly balanced K-D tree, with depth = logN.  In the worst case an unbalanced tree has only one node at each level (depth = N).

    I considered a few alternatives to overcome this major blocker.  One was to use a QuadTree, but as mentioned this would reduce performance.  There are schemes to load a K-D tree in a balanced way, but they seemed complex and potentially non-performant.

    It's always great when people who file issues are also able to provide code to fix the problem.  And happily the NTS developer submitted a pull request with a very nice solution.  He observed that while the K-D tree was being built incrementally, in fact the inserted points were all available beforehand.  He proposed randomizing them before insertion, using the elegantly simple Fisher-Yates shuffle.  This worked extremely well, providing a huge performance boost and eliminated the stack overflow error.  Here's the relative performance times:

    Num PtsRandomizedIn Order
    10K126 ms341 ms
    20K172 ms1924 ms
    50K417 ms12.3 s
    100K1030 ms59 s
    200K1729 ms240 s
    500K5354 msOverflow

    Once the solution was merged into JTS, my colleague Paul Ramsey quickly ported it to GEOS, so that PostGIS and all the other downstream clients of GEOS would not encounter this issue.

    It's surprising to me that this performance issue hasn't shown up in the other two JTS uses of KdTree: SnappingNoder and ConstrainedDelaunayTriangulator. More investigation required!

    Thursday, 8 October 2020

    OverlayNG lands in JTS master

    I'm happy to say that OverlayNG has landed on the JTS master branch!  This is the culmination of over a year of work, and even more years of planning and design.  It will appear in the upcoming JTS 1.18 release this fall.

    As described in previous posts OverlayNG brings substantial improvements to overlay computation in JTS:  

    • A completely new codebase provides greater clarity, maintainability, and extensibility
    • Pluggable noding supports various kinds of noding strategies including Fast Full-Precision Noding, Snapping and Snap-Rounding.
    • Optimizations are built-in, including new ones such as Ring Clipping and Line Limiting.
    • Additional functionality including Precision Reduction and Fast Coverage Union

    All of these improvements are encapsulated in the new OverlayNGRobust class. It provides fully robust execution with excellent performance, via automated fallback through a series of increasingly robust noding strategies.  This should solve a host of overlay issues reported over the years in various downstream projects such as GEOS, PostGIS, Shapely, R-sf and QGIS. (Many of these cases have been captured as  XML tests for overlay robustness, to ensure that they are handled by the new code).

    Initially the idea was to use OverlayNG as an opportunity to simplify and improve the semantics of overlay output, including:

    • Sewing linear output node-to-node (to provide a full union)
    • Ensuring output geometry is homogeneous (to allow easy chaining of overlay operations)
    • Omitting lines created by topology collapse in polygonal inputs

    In the end we decided that this change would have too much impact on existing tests and downstream code, so the default semantics are the same as the previous overlay implementation.  However, the simplified semantics are available as a "strict" mode for overlay operations.  

    At the moment OverlayNG is not wired in to the JTS Geometry overlay operations, but is provided as a separate API.  The plan is to provide a runtime switch to allow choosing which overlay code is used.  This will allow testing in-place while avoiding potential impact on production systems.  GeometryPrecisionReducer has been changed to use OverlayNG with Snap-Rounding, to provide more effective precision reduction.

    GEOS has been tracking the OverlayNG codebase closely for a while now, which has been valuable to finalize the overlay semantics, and for finding and fixing issues.  Having the code in JTS master gives the green light for downstream projects to do their own testing as well. There have been a few issues reported:

    • A minor copy-and-paste variable name issue in HotPixel (which did not actually cause any failures, since it was masked by other logic)
    • A clarification about the new behaviour of GeometryPrecisionReducer, revealed by an NTS test
    • Most notably, a serious performance issue with Snap-Rounding of large geometries was identified and fixed by a member of the NTS project.  This is quite interesting, so I'll discuss it in detail in another post.

    After years of designing and developing improvements to overlay, it's great to see OverlayNG make its debut.  Hopefully this will be the end of issues involving the dreaded TopologyException. And the new design will make it easier to build other kinds of overlay operations, including things like fast line clipping, split by line, coverage overlay... so stay tuned! 

    Saturday, 20 June 2020

    JTS OverlayNG - Tolerant Topology Transformation

    This is another in a series of posts about the new OverlayNG algorithm being developed for the JTS Topology Suite. (Previous ones are here and here).  Overlay is a core spatial function which allows computing the set-theoretic boolean operations of intersection, union, difference, and symmetric difference over all geometry types. OverlayNG introduces significant improvements in performance, robustness and code design.

    JTS has always provided the ability to specify a fixed-precision model for computing geometry constructions (including overlay).  This ensures that output coordinates have a defined, limited precision.  This can reduce the size of data transfers and storage, and generally leads to cleaner, simpler geometric output.  The original overlay implementation had some issues with robustness, which were exacerbated by using fixed-precision.  One of the biggest improvements in OverlayNG is that fixed-precision overlay is now guaranteed to be fully robust.  This is achieved by using an implementation of the well-known snap-rounding noding paradigm. 

    Geometric algorithms which operate in a fixed-precision model can encounter situations called topology collapse.  This happens when line segments and points become coincident due to vertices or intersection points being rounded.  The OverlayNG algorithm detects occurrences of topology collapse and transforms them into valid topology in the overlay result.

    Topology collapse during overlay with a fixed precision model

    As a bonus, handling topology collapse during the overlay process also allows it to be tolerated when present in the original input geometries.  This means that some kinds of "mildly" invalid geometry (according to the OGC model) are acceptable as input.  Invalid geometry is transformed to valid geometry during the overlay process.

    Specifically,  input geometry may contain the following situations, which are invalid in the OGC geometry model:
    • A ring which self-touches at discrete points (the so-called "inverted polygon" or "exverted hole")
    • A ring which self-touches in one or more line segments
    • Rings which touch other ones along one or more line segments 
    Note that this does not extend to handling polygons that overlap, rather than simply touch.  These are "strongly invalid", and will trigger a TopologyException during overlay.

    An interesting use for this capability is to process individual geometries.  By simply computing the union of a single geometry the geometry is transformed into an OGC-valid geometry.  In this way OverlayNG functions as a (partial) "MakeValid" operation.  
    A polygon which self-touches in a line transforms to a valid polygon with a hole

    A polygon which self-touches in a point transforms to a valid polygon with a hole

    A collection of polygons which touch in lines transforms to a valid polygon with a hole

    Moreover, some spatial systems use geometry models which do not conform to the OGC semantics.  Some systems (such as ArcGIS) actually specify the use of inverted polygons and exverted holes in their topology model.  And in days of yore there were systems which were unable to model holes explicitly, and so used a "connected hole" topology hack (AKA "lollipop holes".) This represented  holes as an inversion connected by a zero-width corridor to the polygon shell. Both of these models are accepted by OverlayNG. Thus it provides a convenient way to convert from these non-standard models into OGC-valid topology. 

    This is one more reason why overlay is the real workhorse of spatial data processing!

    Thursday, 18 June 2020

    JTS OverlayNG - Noding Strategies

    In a previous post I unveiled the exciting new improvement in the JTS Topology Suite called OverlayNG.  This new implementation provides significant improvements to the core function of spatial overlay.  Overlay supports computing the set-theoretic boolean operations of intersection, union, difference, and symmetric difference over all geometric types.

    One of the design goals of JTS is to create modular, reusable data structures and processes to implement spatial algorithms.  This increases development velocity and testability, and makes algorithms easier to understand.  In spatial algorithms it is not always obvious how to identify appropriate abstractions for reuse, so this is an on-going effort of design and refactoring.

    After the implementation of spatial overlay in the very first release of JTS, it became clear that overlay can be split into the following phases:

    1. Noding, in which an set of possibly-intersecting linestrings is converted to an arrangement in which linestrings touch only at endpoints
    2. Topology Analysis, during which the topology graph of the noded arrangement is determined
    3. Result Extraction, in which the geometric components of the desired result are extracted from the topology graph

    It also became clear that the Noding phase is critical, since it determines the overall performance and robustness of the overlay computation.  Moreover, tradeoffs between these two qualities can be made by using different noding strategies.  For instance, the "classic" JTS noding approach is fast, but susceptible to robustness issues.  Alternatively, noding using the well-known snap-rounding paradigm is slower, but can be made fully robust. 

    To encapsulate this concept, JTS introduced the Noder API.  Since it post-dated the original overlay code, using it in overlay had to await a reworking of that codebase.  The OverlayNG project provided this opportunity.  OverlayNG allows supplying a specific Noder class to be used during overlay. 

    One of the main goals of the OverlayNG project was to develop a noder to provide fully robust noding.  This would eliminate the notorious TopologyException errors which bedevil the use of overlay.  The effort has paid off with the development of not one, but two new noders.  The Snapping Noder has very good performance and (with the addition of some heuristics, and so far as is known) provides robust full-precision evaluation.  And the Snap-Rounding noder provides guaranteed robustness as well the ability to enforce a fixed-precision model for output.

    So now OverlayNG can be run with the following suite of noders, depending on use case.  The images show the result of intersection and union on the following geometries:

    Fast Full-Precision Noder

    The MCIndexNoder noding strategy has been available since the early days of JTS. It has very good performance due to the use of monotone chains and the STRtree spatial index.  However, it is a relatively simple algorithm which due to numerical robustness issues does not always produce a valid noding.  In overlay it is always used in conjunction with a noding validator, so that noding failure can be detected and an alternative strategy used to perform the operation successfully.

    Intersection and Union with full-precision floating noding

    Snapping Noder

    The SnappingNoder is a refinement of the MCIndexNoder which snaps existing input and computed intersection vertices together, if they are closer than a snap distance tolerance.  This dramatically improves the robustness of the noding, with only minor impact on performance. 

    Noding robustness issues are generally caused by nearly coincident line segments, or by very short line segments.  Snapping mitigates both of these situations.  The choice of snap tolerance is a heuristic one.  Generally, a smaller snap distance has less chance of distorting the topology, but it need to be large enough to resolve intersection computation imprecision.  In practice, excellent robustness is provided by using a very small snap distance (e.g. a factor of 10^12 smaller than the geometry magnitude).

    Snapping of course risks creating topology collapses, but OverlayNG  is designed to handle these correctly.  However, there are occasional situations where the snapped arrangement is too invalid to  be handled.  This can be detected, and with some simple heuristic adjustments (e.g. a more aggressive  snap distance) the overlay can be rerun.  This strategy has proven to be fully robust in all cases tried so far.

    Intersection and Union with Snapping Noding (snap tolerance = 0.5)

    Snap-Rounding Noder

    The SnapRoundingNoder implements the well-known snap-rounding paradigm.  It provides fully robust noding by rounding and snapping linework to a fixed-precision grid.  This has the unavoidable effect of rounding every output vertex to the precision grid. This may or may not be desirable depending on the situation.  A useful side effect is that it provides an effective means of reducing the precision of geometries in a topologically valid way. 

    In the early stages of OverlayNG design and development I expected that snap-rounding would be required to ensure fully-robust overlay, in spite of the downside of fixed-precision output.  But the development of the SnappingNoder and accompanying heuristics means that this noder need only be used when control over overlay output precision is desired. 

    Using an appropriate precision model is a highly worthwhile goal in spatial data management, since it reduces the amount of memory needed to represent data, and improves robustness and portability.  This is unfortunately often neglected, mostly due to lack of tools available to enforce it.  Hopefully this capability will encourage users to maintain a precision model which is better matched to the true precision of their data.

    Intersection and Union with Snap-Rounding noding (precision scale = 1)

    Segment Extracting Noder

    This is a special-purpose noder which is really more of a "non-noder". It simply extracts every line segment in the input.  It is used on geometry collections which form valid, fully-noded, non-overlapping polygonal coverages.  When used with OverlyNG, this has the effect of dissolving the duplicate line segments and producing the union of the input coverage. By taking advantage of the structure inherent in the coverage model the SegmentExtractingNoder offers very fast performance.  It can also operate on fully-noded linear networks. 

    Union of a Polygonal Coverage with SegmentExtractingNoder

    The support for pluggable noding and the development of a suite of fast and/or robust noders constitutes the biggest advance of the OverlayNG code.  It finally allows JTS to provide fully robust noding and true support for a fixed-precision model!  This has been a dream of mine for more than a decade.  It's good to think that the end of the era of TopologyException issues is in sight!