Monday, 1 November 2021

JTS Polygon Triangulation, at last

A (long) while ago I posted about "soon-to-be-released" JTS code for polygon triangulation using Ear Clipping.  It turned out it was actually in the category of "never-to-be-released".  However, later  I worked with a student, Dan Tong, on a coding exercise sponsored by Facebook. We decided to tackle polygon triangulation.  By the end of the project he implemented a functional algorithm, including handling holes (via the hole-joining technique described by Eberly).  To make it release-ready the code needed performance and structural improvements.  This has finally happened and the code is out!  

Polygon Triangulation

The new codebase has been substantially rewritten to improve modularity and the API.  A new data structure to represent triangulations has been introduced, called a Tri  It is simpler, more memory efficient, and easier to work with than the QuadEdge data structure previously used in JTS. It provides an excellent basis for developing further algorithms based on triangulations (of which more below)

Most of the reworking involved implementing key performance optimizations, utilizing existing and new JTS spatial index structures: 

  • the Hole Joining phase now uses a spatial index to optimize the test for whether a join line is internal to a polygon
  • the Ear-Clipping phase has to check whether a candidate ear contains any vertices of the polygon shell being clipped.  This has been made performant by using a new spatial index structure called the VertexSequencePackedRtree

Performance must be benchmarked against an existing implementation.  Geometry maestro Vladimir Agafonkin has developed an Ear Clipping algorithm in Javascript called Earcut.  It's claimed to be the fastest implementation in Javascript.  Indeed the performance comparison shows Earcut to be over 4 times faster than some other Javascript triangulation implementations.  Earcut has been ported to Java as earcut4j, which allows comparing JTS to it.   

DatasetSizeJTS TimeEarcut4j Time
Lake Superior3,478 vertices / 28 holes18 ms13 ms
Canada10,522 vertices35 ms26 ms
Complex polygon44,051 vertices / 125 holes1.1 s0.64 s
North America with Lakes115,206 vertices / 1,660 holes13.3 s5.4 s

So JTS ear clipping is not quite as fast as Earcut4j.  But it's still plenty fast enough for production use!

The triangulation algorithm is exposed as the PolygonTriangulator class (deliberately named after the effect, rather than the algorithm producing it).  Here's an example of the output:

Ear-Clipping Triangulation of Lake Superior

Constrained Delaunay Triangulation

The output of PolygonTriangulator is guaranteed to be a valid triangulation, but in the interests of performance it does not attempt to produce a high-quality one.  Triangulation quality is measured in terms of minimizing the number of "skinny" triangles (or equivalently, maximizing the sum of the interior angles).  Optimal quality is provided by Constrained Delaunay Triangulation (CDT).  In addition to being optimal in terms of triangle shape, the CDT has the nice properties of being (essentially) unique, and of accurately representing the structure of the input polygon.  It also removes the influence that Hole Joining holes has on the raw triangulation provided by Ear-Clipping.

I originally proposed that a CDT can be achieved by a Delaunay Improvement algorithm based on iterated triangle flipping.  It turned out that this technique is not only effective, but also pleasingly performant.  This is implemented in the ConstrainedDelaunayTriangulator class.  Running it on the example above shows the improvement obtained.  Note how there are fewer narrow triangles, and their arrangement more closely represents the structure of the polygon.

Constrained Delaunay Triangulation of Lake Superior

This code will appear in the next JTS version (1.18.3 or 1.19), and has already been released in GEOS 3.10.

Future Work

It seems possible to adapt the machinery for Ear Clipping to provide two other useful constructions:

  • Polygon Concave Hull ("Outer Hull") - this requires a triangulation of the area between the polygon and its convex hull.  That can be obtained by running Ear Clipping outwards, rather than inwards. A Concave Hull can be constructed by limiting the number of triangles generated based on some criteria such as area, longest edge, or number of vertices.  The result is a simplified polygon with fewer vertices which is guaranteed to contain the original polygon. 
  • Polygon Inner Hull - the triangulation constructed by Ear Clipping can be "eroded" to provide a simpler polygon which is guaranteed to be contained in the original.
The property of the Constrained Delaunay Triangulation of representing the structure of the input polygon could provide a basis for computing the following constructions: 
  • Approximate Skeleton construction (also known as the Medial Axis) by creating edges through the CDT triangles
  • Equal-Area Polygon Subdivision, by partitioning the graph induced by the CDT
The Tri data structure provides a simple representation of a triangulation.  I expect that it will facilitate the development of further JTS triangulation algorithms, such as:
  • Concave Hull of Point Set
  • Constrained Delaunay Triangulation of Lines and Points


Saturday, 9 October 2021

Query KD-trees 100x faster with this one weird trick!

Recently a GEOS patch was contributed to change the KdTree query implementation to use an explicit stack rather than recursion.  This has been ported to JTS as PR #779 (along with some refactoring).

The change was motivated by a QGIS issue in which a union of some large polygons caused a stack overflow during a KdTree query.  The reason is that the poor vertex alignment of the input polygons causes the union overlay process to invoke the SnappingNoder.  (For background about why this occurs see the post OverlayNG - Noding Strategies).  The snapping noder snaps vertices using a KdTree with a tolerance.  The polygon boundaries contain runs of coherent (nearly-monotonic) vertices (which is typical for high-resolution polygons delineating natural areas).  When these are loaded directly into a KdTree the tree can become unbalanced.  

An unbalanced tree has a relatively large depth for its size.  The diagrams below shows the graph of the KdTree for a polygon containing 10,552 vertices.  The unbalanced nature of the tree is evident from the few subtrees extending much deeper than the rest of the tree. The tree depth is 282, but most of that occurs in a single subtree.

An unbalanced KD-tree (size = 10,552,  depth = 282)

Left: full graph.     Right: close-up of graph structure

For large inputs querying such a deep subtree causes the recursive tree traversal to exceed the available call stack. Switching to an iterative implementation eliminates the stack overflow error, since the memory-based stack can grow to whatever size is needed.

However, stack overflow is not the only problem!  Unbalanced KD-trees also exhibit poor query performance.  This is because deep tree traversals make the search runtime more linear than logarithmic.   Increasing the size of the query stack increases robustness, but does not improve the performance problem.  The best solution is to build the KD-tree in a balanced way in the first place. 

One way to do this is to to break up monotonic runs by randomizing the order of vertex insertion.  This is actually implemented in the OverlayNG SnapRoundingNoder, as discussed in the post Randomization to the Rescue.  However, using this approach in the SnappingNoder would effectively query the tree twice for each vertex.  And in fact it's not necessary to randomize all the inserted points.  A better-balanced tree can be produced by "seeding" it with a small set of well-chosen points.  

But how can the points be chosen?   They could be selected randomly, but true randomness can be quite "clumpy".  Also, a non-deterministic algorithm is undesirable for repeatability.  Both of these issues can be avoided by using a low-discrepancy quasi-random sequence.  This concept has a fascinating theory, with many approaches available.  The simplest to implement is an additive recurrence sequence with a constant α (the notation {...} means "fractional value of"):

R(α) :  tn = { t0 + n * α },  n = 1,2,3,...    

If α is irrational the sequence never repeats (within the limits of finite-precision floating point).  The excellent article The Unreasonable Effectiveness of Quasirandom Sequences suggests that the lowest possible discrepancy occurs when α is the reciprocal of the Golden Ratio φ. 

1 / φ = 0.6180339887498949...

Implementing this is straightforward, and it is remarkably effective.  For the example above, the resulting KD-tree graph is significantly better-balanced, with lower depth (85 versus 282):

Better-balanced KD-tree (size = 10,552,  depth = 85)

Adding KD-tree seeding to the SnappingNoder gives a huge performance boost to the QGIS test case.  For the 526,466 input vertices the tree depth is reduced from 17,776 to 183 and the snapping time is improved by about 100 times, from ~80 s to ~800 ms!  (In this case the overall time for the union operation is reduced by only about 50%, since the remainder of the overlay process is time-consuming due to the large number of holes in the output.)

In GEOS the performance difference is smaller, but still quite significant.  The snapping time improves 14x, from 12.2 s to 0.82 s, with the union operation time decreasing by 40%, from 31 to 18 secs.

This improvement will be merged into JTS and GEOS soon.



Friday, 16 July 2021

JTS IsValidOp Built Back Better

In a previous post I described how the JTS Topology Suite operation IsSimpleOp has been completely rewritten to reduce code dependencies, improve performance, and provide a simpler, more understandable implementation.  The post points out that the IsValidOp implementation would benefit from the same treatment.  This work has now been carried out, with similar benefits achieved.  

The original IsValidOp code used the GeometryGraph framework to represent the topology of geometries.  This code reuse reduced development effort, but it carried some serious drawbacks:

  • The GeometryGraph structure was used for many JTS operations, including overlay, buffer, and spatial predicate evaluation.  This made the code complex, hard to understand, and difficult to maintain and enhance
  • The GeometryGraph structure computes the full topology graph of the input geometry, in order to support constructive operations such as overlay and buffer.  This makes it slower and more subject to robustness problems.  Non-constructive operations such as IsValidOp and IsSimpleOp can compute topology "on-the-fly", which allows short-circuiting processing as soon as an invalidity is found.  
The new IsValidOp implementation is more self-contained, relying only on the noding framework and some basic spatial predicates.  Using the MCIndexNoder allows short-circuited detection of some kinds of invalidity, which improves performance.  And it will be easy to switch to a different noding strategy should a superior one arise.
However, dropping GeometryGraph means that the topology information required to confirm validity needs to be computed some other way.  The goal is to compute just enough topological structure to evaluate validity, and do that efficiently on an as-needed basis.  This required some deep thinking about validity to pare the topology information required down to a minimum.  This was assisted by the extensive set of unit tests for IsValidOp, which ensured that all possible situations were handled by the new logic (although there was an untested situation which uncovered a bug late in development.) The resulting algorithm uses some elegant logic and data structures, which are explained in detail below. 

OGC Validity

JTS implements the OGC Simple Features geometry model.  To understand the validity algorithm, it helps to review the rules of geometry validity in the OGC specification:
For non-polygonal geometry types, the validity rules are simple;
  1. Coordinates must contain valid numbers
  2. LineStrings must not be zero-length
  3. LinearRings must be simple (have no self-intersections)
For polygonal geometry, on the other hand, the validity rules are quite stringent (sometimes considered too much so!).  The rules are:
  1. Rings must be simple (have no self-intersections)
  2. Rings must not cross, and can touch only at discrete points
  3. Holes must lie inside their shell
  4. Holes must not lie inside another hole of their polygon
  5. The interiors of polygons must be connected
  6. In a MultiPolygon, the element interiors must not intersect (so polygons cannot overlap)
  7. In a MultiPolygon, the element boundaries may intersect only at a finite number of points

Validity Algorithm for Polygonal Geometry

The new IsValidOp uses an entirely new algorithm to validate polygonal geometry.  It has the following improvements:
  • It supports short-circuiting when an invalid situation is found, to improve performance
  • Spatial indexing is used in more places
  • The code is simpler and more modular.  This makes it easier to understand, and easier to adapt to new requirements (e.g. such as the ability to validate self-touching rings described below)
The algorithm consists of a sequence of checks for different kinds of invalid situations.  The order of the checks is important, since the logic for some checks depends on previous checks passing. 

1. Check Ring Intersections

The first check determines if the rings in the geometry have any invalid intersections.  This includes the cases of a ring crossing itself or another ring, a ring intersecting itself or another ring in a line segment (a collinear intersection), and a ring self-touching (which is invalid in the OGC polygon model).  This check can use the robust and performant machinery in the JTS noding package.  If an invalid intersection is found, the algorithm can return that result immediately.

Kinds of invalid intersections: 
(i) a self-touch; (ii) collinear; (iii) segment crossing; (iv) vertex crossing

This check is an essential precursor to all the remaining checks.  If a geometry has no invalid intersections this confirms that its rings do not self-cross or partially overlap.  This means that the results of orientation and point-in-polygon computations are reliable.  It also indicates that the rings are properly nested (although it remains to be determined if the nesting of shells and holes is valid).

Intersection search is the most computationally expensive phase of the algorithm.  Because of this, information about the locations where rings touch is saved for use in the final phase of checking connected interiors.  

2. Check Shell and Hole nesting

If no invalid intersections are found, then the rings are properly nested.  Now it is necessary to verify that shells and holes are correctly positioned relative to each other:
  • A hole must lie inside its shell
  • A hole may not lie inside another hole. 
  • In MultiPolygons, a shell may not lie inside another shell, unless the inner shell lies in a hole of the outer one.  
Ring position can often be tested using a simple Point-In-Polygon test.  But it can happen that the ring point tested lies on the boundary of the other ring.  In fact, it is possible that all points of a ring lie on another ring.  In this case the Point-In-Polygon test is not sufficient to provide information about the relative position of the rings.  Instead, the topology of the incident segments at the intersection is used to determine the relative position.  Since the rings are known to not cross, it is sufficient to test whether a single segment of one ring lies in the interior or exterior of the other ring.

MultiPolygon with an element where all vertices lie on the boundary
Checking nesting of holes and shells requires comparing all interacting pairs of rings.  A simple looping algorithm has quadratic complexity, which is too slow for large geometries.  Instead, spatial indexes (using the STRtree) are used to make this process performant. 

3. Check Connected Interior 

The final check is that the interior of each polygon is connected.  In a polygon with no self-touching rings, there are two ways that interior disconnection can happen:

  • a chain of one or more touching holes touches the shell at both ends, thus splitting the polygon in two
  • a chain of of two or more touching holes forms a loop, thus enclosing a portion of the polygon interior  
To check that these conditions do not occur, the only information needed is the set of locations where two rings touch.  These induce the structure of an undirected graph, with the rings being the graph vertices, and the touches forming the edges.  The situations above correspond to the touch graph containing a cycle (the splitting situation is a cycle because the shell ring is also a graph vertex).  (Equivalently, the touch graph of a valid polygon is a tree, or more accurately a forest, since there may be sets of holes forming disconnected subgraphs.) So a polygon can be verified to have a connected interior by checking that its touch graph has no cycles.  This is done using a simple graph traversal, detecting vertices (rings) which are visited twice.
Situations causing a disconnected interior: 
(i) a chain of holes; (ii) a cycle of holes
If no invalid situations are discovered during the above checks, the polygonal geometry is valid according to the OGC rules.

Validity with Self-touching Rings

Some spatial systems allow a polygon model in which "inverted shells" are valid.  These are polygon shells which contain a self-touch in a way that encloses some exterior area.  They also allow "exverted holes", which contain self-touches that disconnect the hole into two or more lobes.  A key point is that they do not allow "exverted shells" or "inverted holes"; self-touches may disconnect the polygon exterior, but not the interior.  Visually this looks like:
Left: Inverted shell and exverted hole (valid)
Right: Exverted shell and inverted hole (invalid)

This kind of topology is invalid under the OGC polygon model, which prohibits all ring self-intersections.  However, the more lenient model still provides unambiguous topology, and most spatial algorithms can handle geometry of this form perfectly well.  So there is a school of thought that considers the OGC model to be overly restrictive.
To support these systems JTS provides the setSelfTouchingRingFormingHoleValid option for IsValidOp to allow this topology model to be validated.  Re-implementing this functionality in the new codebase initially presented a dilemma. It appeared that building the touch graph to check connectivity required computing the geometry of the additional rings formed by the inversions and exversions.  This needed a significant amount of additional code and data structures, eliminating the simplicity of the graph-based approach.  
However, it turns out that the solution is much simpler.  It relies on the fact that valid self-touches can only disconnect the exterior, not the interior.  This means that self-touches of inverted shells and exverted holes can be validated by the local topology at the self-intersection node.  This condition can be tested by the same logic already used to test for nested touching shells.  Even better, if the self-touches are valid, then the touch-graph algorithm to check connectivity still works.   (This is the same phenomenon that allows most spatial algorithms to work correctly on the inverted/exverted ring model.)  

Performance

As expected, the new codebase provides better performance, due to simpler code and the ability to short-circuit when an invalidity is found.  Here's some performance comparisons for various datasets:

DataSizeNew (ms)Old (ms)Improvement
world 244 Polygons, 366K vertices 3401250   3.7 x
invalid-polys 640 Polygons, 455K vertices 126334   2.6 x
valid-polys 640 Polygons, 455K vertices 244487   2 x
australia 1 MultiPolygon, 1,222K vertices 132169026   52 x

  • world is the standard dataset of world country outlines
  • invalid-polys is a dataset of all invalid polygons.  valid-polys is the same dataset processed by GeometryFixer.  The timings show the effect of short-circuiting invalidity to improve performance. 
  • australia is a geometry of the Australian coastline and near-shore islands, with 6,697 polygon elements.  The dramatic difference in performance reflects the addition of a spatial index for checking nested shells.
Australia and 6,696 islands

Next Steps

The improved IsValidOp will appear in JTS version 1.19.  It has already been ported to GEOS, providing similar performance improvements.  And since GEOS backs the PostGIS ST_IsValid function, that will become faster as well.

There is one remaining use of GeometryGraph in JTS for a non-constructive operation:  the RelateOp class, used to compute all topological spatial predicates.  Converting it should provide the same benefits of simpler code and improved performance.  It should also improve robustness, and allow more lenient handling of invalid inputs.  Watch this space!

Wednesday, 26 May 2021

JTS IsSimple gets simpler (and faster)

Hard to believe that the JTS Topology Suite is almost 20 years old.  That's 140 in dog years!  Despite what they say about old dogs, one of the benefits of longevity is that you have the opportunity to learn a trick or two along the way.  One of the key lessons learned after the initial release of JTS is that intersection (node) detection is a fundamental part of many spatial algorithms, and critical in terms of performance.  This resulted in the development of the noding package to provide an API supporting many different kinds of intersection detection and insertion.

Prior to this, intersection detection was performed as part of the GeometryGraph framework, which combined it with topology graph formation and analysis.  At the time this seemed like an elegant way to maximize code reuse across many JTS operations, including overlay, buffering, spatial predicates and validation.  But as often the case, there are significant costs to such general-purpose code:

  • The overall codebase is substantially more complex
  • A performance penalty is imposed on algorithms which don't require topology construction
  • Algorithms are harder to read and understand.  
  • The code is brittle, and so hard to modify
  • Porting the code is more difficult  
Because of this, a focus of JTS development is to free operations from their dependency on GeometryGraph - with the ultimate goal of expunging it from the JTS codebase.  A major step along this road was the rewrite of the overlay operations.

Another operation that relies on GeometryGraph is the IsSimpleOp class, which implements the OGC Simple Features isSimple predicate.  The algorithm for isSimple essentially involves determining if the geometry linework contains a self-intersection.  GeometryGraph is unnecessarily complex for this particular task, since there is no need to compute the entire topology graph in order to find a single self-intersection.  Reworking the code to use the the MCIndexNoder class in the noding API produces a much simpler and more performant implementation. I also took the opportunity to move the code to the operation.valid package, since the operations of isSimple and isValid are somewhat complementary.

Now, isSimple is probably the least-used OGC operation.  Its only real use is to test for self-intersections in lines or collections of lines, and that is not a critical issue for many workflows.  However, there is one situation where it is quite useful: testing that linear network datasets are "vector-clean" - i.e. contain LineStrings which touch only at their endpoints.

A linear network containing non-simple intersections (isSimple == false)

To demonstrate the performance improvement, I'll use a dataset for Rivers of the US maintained by the US National Weather Service.  It supplies two datasets: a full dataset of all rivers, and a subset of major rivers only.  You might expect a hydrographic network to be "vector-clean", but in fact both of these datasets contain numerous instances of self-intersections and coincident linework.

Here's the results of running the isSimple predicate on the datasets.  On the larger dataset the new implementation provides a 20x performance boost!

 Dataset New time  Old time 
  Subset  (909,865 pts)   0.25 s   1 s
  Full   (5,212,102 pts)  2 s  30 s

Finding Non-Simple Locations

The new codebase made it easy to add a functionality enhancement that computes the locations of all places where lines self-intersect.  This can be used to for visual confirmation that the operation is working as expected, and to indicate places where data quality needs to be improved.  Here's the non-simple intersection points found in the river network subset:

Closeups of some non-simple intersection locations:


IsSimpleOp is the easiest algorithm to convert over from using GeometryGraph.  As such it serves as a good proof-of-viability, and establishes useful code patterns for further conversions. 

Next up is to give IsValidOp the same treatment.  This should provide similar benefits of simplicity and performance.  And as always, porting the improved code to GEOS.



Monday, 3 May 2021

Fixing Invalid Geometry with JTS

TLDR: JTS can now fix invalid geometry!

The JTS Topology Suite implements the Geometry model defined in the OGC Simple Features specification.  An important part of the specification is the definition of what constitutes valid geometry.  These are defined by rules about the structural and geometric characteristics of geometry objects.  Some validity rules apply to all geometry; e.g. vertices must be defined by coordinates with finite numeric values (so that NaN and Inf ordinates are not valid).  In addition, each geometric subtype (Point, LineString, LinearRing, Polygon, and Multi-geometry collections) has its own specific rules for validity.

The rules for Polygons and MultiPolygons are by far the most restrictive.  They include the following constraints:

  1. Polygons rings must not self-intersect
  2. Rings may touch at only a finite number of points, and must not cross
  3. A Polygon interior must be connected (i.e. holes must not split a polygon into two parts)
  4. MultiPolygon elements may touch at only a finite number of points, and must not overlap

These rules guarantee that:

  • a given area is represented unambiguously by a polygonal geometry
  • algorithms operating on polygonal geometry can make assumptions which provide simpler implementation and more efficient processing

Valid polygonal geometry is well-behaved

Given the highly-constrained definition of polygonal validity, it is not uncommon that real-world datasets contain polygons which do not satisfy all the rules, and hence are invalid. This occurs for various reasons:

  • Data is captured using tools which do not check validity, or which use a looser or different definition than the OGC standard
  • Data is imported from systems with different polygonal models
  • Data is erroneous or inaccurate 

Because of this, JTS does not enforce validity on geometry creation, apart from a few simple structural constraints (such as rings having identical first and last points).  This allows invalid geometry to be represented as JTS geometry objects, and processed using JTS code.  Some kinds of spatial algorithms can execute correctly on invalid geometry (e.g. determining the convex hull).  But most algorithms require valid input in order to ensure correct results (e.g. the spatial predicates) or to avoid throwing exceptions (e.g. overlay operations).  So the main reason for representing invalid geometry is to allow validity to be tested, to take appropriate action on failure.

Often users would like "appropriate action" to be Just Make It Work.  This requires converting invalid geometry to be valid.  Many spatial systems provide a way to do this: 

But this has a been a conspicuous gap in the JTS API.  While it is possible to test for validity, there has never been a way to fix an invalid geometry.  To be fair, JTS has always had an unofficial way to make polygonal geometry valid.  This is the well-known trick of computing geometry.buffer(0), which creates a valid output which often is a good match to the input. This has worked as a stop-gap for years (in spite of an issue which caused some problems, now fixed - see the post Fixing Buffer for fixing Polygons). However, using buffer(0) on self-intersecting "figure-8" polygons produces a "lossy" result.  Specifically, it retains only the largest lobes of the input linework.  This is undesirable for some uses (although it is advantageous in other situations, such as trimming off small self-intersections after polygon simplification).

Buffer(0) of Figure-8 is lossy 

So, it's about time that JTS stepped up to provide a supported, guaranteed way of fixing invalid geometry. This should handle all geometry, although polygonal geometry repair is the most critical requirement.

This raises the question of what exactly the semantics of repairing polygons should be.  While validity is well-specified, there are no limits to the complexity of invalid polygons, and a variety of possible approaches to fixing them.  The most significant decision is how to determine the interior and exterior of a polygonal geometry with self-intersections or overlaps. (This is the classic "bow-tie" or "figure-8" - although self-intersecting polygons can be far more complex.) The question comes down to whether the geometry linework or structure is used to determine interior areas.  

If linework is used to create validity, to node the constituent linework to form a topologically-valid coverage.  This coverage is then scanned with an alternating even-odd strategy to assign areas as interior or exterior.  This may result in adjacent interior or exterior areas, in which case these are merged.

Alternatively, the structure of the polygonal geometry can be taken as determinative. The shell and hole rings are assumed to accurately specify the nature of the area they enclose (interior or exterior).  Likewise, the (potentially overlapping or adjacent) elements of a MultiPolygon are assumed to enclose interior area.  The repair operation processes each ring and polygon separately.  Holes are subtracted from shells.  Finally, if required the repaired polygons are unioned to form the valid result.  

PostGIS MakeValid and the ESRI Java Geometry API makeSimple both use the linework approach.  However, for some relatively simple invalid geometries this produces results which seem overly complex.  
Complex output from ST_MakeValid

For this reason I have implemented the structure-based approach in JTS.  It provides results that are closer to the existing buffer(0) technique (and conveniently allows using the existing buffer code).  This made it a relatively simple matter to implement the repair algorithm as the GeometryFixer class.

Here's some examples of how GeometryFixer works.  First, the example above, showing the (arguably) simpler result that arises from using the structure information:
Figure-8s are handled as desired (keeping all area):

Self-overlapping shells have all interior area preserved:
Of course, the GeometryFixer also handles simple fixes for all geometry types, such as removing invalid coordinates.

One further design decision is how to handle geometries which are invalid due to collapse (e.g. a line with a single point, or a ring which has only two unique vertices).  GeometryFixer provides an option to either remove collapses, or to return them as equivalent lower dimensional geometry. 

To see the full range of effects of GeometryFixer, the JTS TestBuilder can be used to view and run the GeometryFixer on the set of 75 test cases for invalid polygonal geometry in the file TestInvalidA.xml

It's been a long time coming, but finally JTS can function as a full-service repair shop for geometry, no matter how mangled it might be.




  


Wednesday, 28 April 2021

JTS goes to Mars!

The other day this badge popped up on my GitHub profile:

Sure enough, there is JTS in the list of dependencies provided by the Jet Propulsion Laboratory (JPL). 

So take that, Elon - JTS got there first!  With a lot of help from NASA, of course. 

Oh how nice, the Earthlings are giving me some target practice!


Tuesday, 2 February 2021

OverlayNG and Invalid Geometry

A recent blog post by Elephant Tamer gives a critical appraisal of the improvements to overlay processing shipped in PostGIS 3.1 with GEOS 3.9.  The author is disappointed that PostGIS still reports errors when overlay is used on invalid geometry.  However, this is based on a misunderstanding of the technology. 

GEOS 3.9 includes OverlayNG, ported from the JTS Topology Suite). It brings a major advance in overlay robustness, along with other improvements (described here and here).  Previously, robustness limitations in the overlay algorithm could sometimes cause errors even for inputs which were topologically valid.  This was doubly problematic because there was no fully effective way to process the input geometries to avoid the errors.  Now, OverlayNG solves this problem completely.  Valid inputs will always produce a valid and essentially correct(*) output.

(*) "Essentially" correct, because in order to achieve full robustness a snapping heuristic may be applied to the input geometry.  However, this is done with a very fine tolerance, so should not appreciably alter the output from the theoretically correct value.

But for invalid inputs, OverlayNG will still report errors.  The reason is that there is a wide variety of gruesome ways in which geometry can be invalid.  Automated handling of invalidity would involve expensive extra processing, and also require making assumptions about what area a corrupt geometry is intended to represent.  Rather than silently repairing invalid geometry and returning potentially incorrect results, the design decision is to report this situation as an error.

In fact, OverlayNG is able to handle "mildly" invalid polygons, as described in this post. This covers situations which are technically invalid according to the OGC SFS specification, but which still have well-defined topology.  This includes self-touching rings (sometimes called "inverted polygons" or "exverted holes"), and zero-width gores and spikes.

Taking a detailed look at the data used in the blog post, we can see these improvements at work.  The dataset is the ecology polygons obtained from the GDOS WFS server.  This contains 7662 geometries, of which 10 are invalid.  Using the old overlay algorithm, 9 of these invalid polygons cause TopologyException errors.  Using OverlayNG, only 4 of them cause errors. 

The polygons that can now be processed successfully are typical "OGC-invalid" situations, which do not materially affect the polygonal topology.  These include self-touching rings with pinch points:

and zero-width gores:


Of the cases that still cause errors, two are classic small bow-tie errors:


And two are wildly invalid self-crossing rings:



The last two are good examples of geometry which is so invalid that it is impossible to unambiguously decide what area is represented (although ST_MakeValid will happily grind them into something that is technically valid).

Ultimately it is the user's responsibility to ensure that geometries to be processed by overlay (and many other PostGIS functions) have valid topology (as reported by ST_IsValid).  Ideally this is done by correcting the data at source.  But it can also be done a posteriori in the database itself, by either the ST_MakeValid function, or the well-known buffer(0) trick. (Which to use is a topic for another blog post...)

One improvement that could be made is to check for input validity when OverlayNG throws an error.  Then PostGIS can report definitively that an overlay error is caused by invalid input.  If there is an overlay error that is not caused by invalidity, the PostGIS team wants to hear about it! 

And perhaps there is a case to be made for repairing invalid geometry automatically, even if the repair is suspect.  Possibly this could be invoked via a flag parameter on the overlay functions. More research is required - feedback is welcome!

Wednesday, 13 January 2021

Using the GEOS geosop CLI

In a previous post I announced the new geosop command-line interface (CLI) for the GEOS geometry API.  This post provides examples of using geosop for various tasks. (Refer to the README for more information about the various options used.)

API Testing

geosop makes testing the GEOS library much easier.  Previously, testing the behaviour of the API usually required breaking out the C compiler (and updating autotools and cmake build files, and deciding whether to commit the new code for later use or throw it away, etc, etc).  Now testing is often just a matter of invoking a geosop operation on appropriate data, or at worst adding a few lines of code to the exiting framework.

For example, there is a long-standing issue with how GEOS handles number formatting in WKT output.  There are recent bug reports about this in GeoSwift and the Julia LibGEOS.  geosop makes it easy to run the test cases and see the less-than-desirable output:

geosop -a "POINT (654321.12 0.12)" -f wkt
POINT (654321.1199999999953434 0.1200000000000000)

geosop -a "POINT (-0.4225977234 46.3406448)" -f wkt
POINT (-0.4225977234000000 46.3406447999999997)

There's also an issue with precision handling.  To test this we added a --precision parameter to geosop. (This is the kind of rapid development enabled by having the CLI codebase co-resident with the API.)

geosop -a "POINT (654321.126 0.126)" --precision 2 -f wkt
POINT (6.5e+05 0.13)

Again we see undesirable behaviour. Using scientific notation for small numbers is unnecessary and difficult to read. And the precision value is determining the number of significant digits, not the number of decimal places as intended by the GEOS WKTWriter.setRoundingPrecision API.

These were all caused by using standard C/C++ numeric formatting, which is surprisingly limited and non-useful. After some fine work by Paul Ramsey to integrate the much better Ryu library, GEOS now has WKT output that is sensible and handles precision in a useful way.

By default, WKTWriter now nicely round-trips WKT text:

geosop -a "POINT (654321.12 0.12)" -f wkt
POINT (654321.12 0.12)

geosop -a "POINT (-0.4225977234 46.3406448)" -f wkt
POINT (-0.4225977234 46.3406448)

If WKTWriter.setRoundingPrecision or GEOSWKTWriter_setRoundingPrecision is called, the precision value applies to the decimal part of the number: 

geosop -a "POINT (654321.1234567 0.126)" --precision 0 -f wkt
POINT (654321 0)

geosop -a "POINT (654321.1234567 0.126)" --precision 2 -f wkt
POINT (654321.12 0.13)

geosop -a "POINT (654321.1234567 0.126)" --precision 4 -f wkt
POINT (654321.1235 0.126)

geosop -a "POINT (654321.1234567 0.126)" --precision 6 -f wkt
POINT (654321.123457 0.126)

Performance Testing

A key use case for geosop is to provide easy performance testing.  Performance of geometric operations is highly data-dependent.  It's useful to be able to run operations over different datasets and measure performance.  This allows detecting performance hotspots, and confirming the efficiency of algorithms.

As a simple example of performance testing, for many years GEOS has provided optimized spatial predicates using the concept of a prepared geometry.  Prepared geometry uses cached spatial indexes to dramatically improve performance for repeated spatial operations against a geometry.  Here is a performance comparison of the intersects spatial predicate in its basic and prepared form. 

geosop -a world.wkt -b world.wkt -t intersects
Ran 59,536 intersects ops ( 179,072,088 vertices)
  -- 16,726,188 usec    (GEOS 3.10.0dev)

geosop -a world.wkt -b world.wkt -t intersectsPrep
Ran 59,536 intersectsPrep ops ( 179,072,088 vertices)
  -- 1,278,348 usec    (GEOS 3.10.0dev)

The example of testing the world countries dataset against itself is artificial, but it shows off the dramatic 16x performance boost provided by using a prepared operation. 

Another interesting use is longitudinal testing of GEOS performance across different versions of the library.  For instance, here's a comparison of the performance of interiorPoint between GEOS 3.7 and 3.8.  The interiorPoint algorithm in GEOS 3.7 relied on overlay operations, which made it slow and sensitive to geometry invalidity. GEOS 3.8 included a major improvement which greatly improved the performance, and made the algorithm more robust.  Note that the dataset being used contains some (mildly) invalid geometries towards its end, which produces an error in GEOS 3.7 if the entire dataset is run.  The  --alimit 3800 option limits the number of geometries processed to avoid this issue.

GEOS 3.7

geosop -a ne_10m_admin_1_states_provinces.wkb --alimit 3800 -t interiorPoint
Ran 3,800 operations ( 1,154,703 vertices)  -- 2,452,540 usec  (GEOS 3.7)

GEOS 3.8

geosop -a ne_10m_admin_1_states_provinces.wkb --alimit 3800 -t interiorPoint
Ran 3,800 operations ( 1,154,703 vertices)  -- 35,665 usec  (GEOS 3.8)

The dramatic improvement in interiorPoint performance is clearly visible.

Geoprocessing

The raison d'etre of GEOS is to carry out geoprocessing.  Most users will likely do this using one of the numerous languages, applications and databases that include GEOS.  But it is still useful, convenient, and perhaps more performant to be able to process geometric data natively in GEOS.  

The design of geosop enables more complex geoprocessing via chaining operations together using shell pipes.  For example, here is a process which creates a Voronoi diagram of some points located in the British Isles, and then clips the Voronoi polygons to the outline of the islands.  This also shows a few more capabilities of geosop:
  • input can be supplied as WKT (or WKB) geometry literals on the command-line
  • input can be read from the standard input (here as WKB)
  • the output data is sent to the standard output, so can be directed into a file

geosop 
  -a "MULTIPOINT ((1342 1227.5), (1312 1246.5), (1330 1270), (1316.5 1306.5), (1301 1323), (1298.5 1356), (1247.5 1288.5), (1237 1260))" 
  -f wkb voronoi 
| geosop -a stdin.wkb -b uk.wkt -f wkt intersection 
> uk-vor.wkt





Tuesday, 12 January 2021

Introducing geosop - a CLI for GEOS

The GEOS geometry API is used by many, many projects to do their heavy geometric lifting.  But GEOS has always had a bit of a PR problem.  Most of those projects provide a more accessible interface to perform GEOS operations.  Some offer a high-level language like Python, R, or SQL (and these typically come with a REPL to make things even easier).  Or there are GUIs like QGIS, or a command-line interface (CLI) like GDAL/OGR.  

But you can't do much with GEOS on its own.  It is a C/C++ library, and to use it you need to break out the compiler and start cutting code. It's essentially "headless".  Even for GEOS developers, writing an entire C program just to try out a geometry operation on a dataset is painful, to say the least.  

There is the GEOS XMLTester utility, of course.  It processes carefully structured XML files, but that is hardly convenient.  (And in case this brings to mind a snide comment like "2001 called and wants its file format back", XML actually works very well in JTS and GEOS as a portable and readable format for geometry tests.  But I digress.)

JTS (on which GEOS is based) has the TestBuilder GUI, which works well for testing out and visualizing the results of JTS operations.  JTS also has a CLI called JtsOp. Writing a GUI for GEOS would be a tall order.  But a command-line interface (CLI) is much simpler to code, and has significant utility.  In fact there is an interesting project called geos-cli that provides a simple CLI for GEOS.  But it's ideal to have the CLI code as part of the GEOS project, since it ensures being up-to-date with the library code, and makes it easy to add operations to test new functionality.

This need has led to the development of geosop.  It is a CLI for GEOS which performs a range of useful tasks:

  • Run GEOS operations to confirm their semantics
  • Test the behaviour of GEOS on specific geometric data
  • Time the performance of operation execution
  • Profile GEOS code to find hotspots
  • Check memory usage characteristics of GEOS code
  • Generate spatial data for use in visualization or testing
  • Convert datasets between WKT and WKB

    geosop has the following capabilities:

    • Read WKT and WKB from files, standard input, or command-line literals
    • Execute GEOS operations on the list(s) of input geometries.  Binary operations are executed on every pair of input geometries (i.e. the cross join aka Cartesian product)
    • Output geometry results in WKT or WKB (or text, for non-geometric results)
    • Display the execution time of data input and operations
    • Display a full log of the command processing
    Here's a look at how it works.  

    geosop -h gives a list of the options and operations available:

    geosop - GEOS v. 3.10.0dev
    Executes GEOS geometry operations
    Usage:
      geosop [OPTION...] opName opArg

      -a arg               source for A geometries (WKT, WKB, file, stdin,
                           stdin.wkb)
      -b arg               source for B geometries (WKT, WKB, file, stdin,
                           stdin.wkb)
          --alimit arg     Limit number of A geometries read
      -c, --collect        Collect input into single geometry
      -e, --explode        Explode result
      -f, --format arg     Output format
      -h, --help           Print help
      -p, --precision arg  Sets number of decimal places in WKT output
      -r, --repeat arg     Repeat operation N times
      -t, --time           Print execution time
      -v, --verbose        Verbose output

    Operations:
      area A - computes area for geometry A
      boundary A - computes boundary for geometry A
      buffer A N - cmputes the buffer of geometry A
      centroid A - computes centroid for geometry A
      contains A B - tests if geometry A contains geometry B
      containsPrep A B - tests if geometry A contains geometry B, using PreparedGeometry
      containsProperlyPrep A B - tests if geometry A properly contains geometry B using PreparedGeometry
      convexHull A - computes convexHull for geometry A
      copy A - computes copy for geometry A
      covers A B - tests if geometry A covers geometry B
      coversPrep A B - tests if geometry A covers geometry B using PreparedGeometry
      difference A B - computes difference of geometry A from B
      differenceSR A B - computes difference of geometry A from B rounding to a precision scale factor
      distance A B - computes distance between geometry A and B
      distancePrep A B - computes distance between geometry A and B using PreparedGeometry
      envelope A - computes envelope for geometry A
      interiorPoint A - computes interiorPoint for geometry A
      intersection A B - computes intersection of geometry A and B
      intersectionSR A B - computes intersection of geometry A and B
      intersects A B - tests if geometry A and B intersect
      intersectsPrep A B - tests if geometry A intersects B using PreparedGeometry
      isValid A - tests if geometry A is valid
      length A - computes length for geometry A
      makeValid A - computes makeValid for geometry A
      nearestPoints A B - computes nearest points of geometry A and B
      nearestPointsPrep A B - computes nearest points of geometry A and B using PreparedGeometry
      polygonize A - computes polygonize for geometry A
      reducePrecision A N - reduces precision of geometry to a precision scale factor
      relate A B - computes DE-9IM matrix for geometry A and B
      symDifference A B - computes symmetric difference of geometry A and B
      symDifferenceSR A B - computes symmetric difference of geometry A and B
      unaryUnion A - computes unaryUnion for geometry A
      union A B - computes union of geometry A and B
      unionSR A B - computes union of geometry A and B

    Most GEOS operations are provided, and the list will be completed soon.

    Some examples of using geosop are below.
    • Compute the interior point for each country in a world polygons dataset, and output them as WKT:
    geosop -a world.wkt -f wkt interiorPoint



    • Determine the time required to compute buffers of distance 1 for each country in the world:
    geosop -a world.wkt --time buffer 1

    • Compute the union of all countries in Europe:
    geosop -a europe.wkb --collect -f wkb unaryUnion  


     The README gives many more examples of how to use the various command-line options.  In a subsequent post I'll give some demonstrations of using geosop for various tasks including GEOS testing, performance tuning, and geoprocessing.

    Future Work

    There's potential to make geosop even more useful: 
    • GeoJSON is a popular format for use in spatial toolchains.  Adding GeoJSON reading and writing would allow geosop to be more widely used for geo-processing.  
    • Adding SVG output would provide a way to visualize the results of GEOS operations.   
    • Improve support for performance testing by adding operations to generate various kinds of standard test datasets (such as point grids, polygon grids, and random point fields).  
    And of course, work will be ongoing to keep geosop up-to-date as new operations and functionality are added to GEOS.


    Wednesday, 23 December 2020

    Fixing Buffer for fixing Polygons

    The OGC Simple Features specification implemented by JTS has strict rules about what constitutes a valid polygonal geometry.  These include:

    • Polygon rings must be simple; i.e. they may not touch or cross themselves
    • MultiPolygon elements may not overlap or touch at more than a finite number of points (i.e they may not intersect along an edge)
    These rules were chosen for good reason. They ensure that OGC-valid polygonal geometry is the simplest possible representation of an enclosed area.  This greatly simplifies the evaluation of most operations on polygonal geometry, which leads to improved performance.  JTS operations generally require input which is valid according to OGC rules.  And they always (with some rare exceptions) emit result geometry which is OGC-valid.

    But data in the wild is often not this well-behaved.  This creates the need to "clean" or "make valid"  polygonal geometry in order to carry out operations on it.  Shortly after JTS was first released we discovered a useful trick: constructing a zero-width buffer via geom.buffer(0) converts an invalid polygonal geometry into a valid one.  It can also be used as a simple way of converting "inverted" polygon topology (ESRI-style) into valid OGC topology.  The reason this works is that the buffer algorithm inherently has to handle overlaps and self-intersections since they often occur during the generation of raw buffer offset curves.  The algorithm nodes self-intersections, merges overlaps, and creates new polygons or holes if necessary.

    A polygon with many invalidities: overlap, self-touch in point and line, and a "bow-tie".


    The polygon fixed by using buffer(0)
    Note that the bow-tie portion on the right is considered to lie in the exterior of the polygon due to ring orientation, and thus is removed.

    In the 20 years since the release of JTS (and its derivative GEOS) this trick has passed into the lore of open-source spatial data processing. It has become a recommended technique for fixing invalid polygonal geometry in numerous projects, such as PostGISShapelyRGeoGeoToolsR-sf and QGIS. It's also used internally in JTS, in algorithms such as DouglasPeuckerSimplifier, VWSimplifier, and Densifier which might otherwise produce invalid polygonal results.

    BUT - there's a nasty little surprise lying in wait for users of buffer(0)! It doesn't always work.  It turns out that the buffer algorithm has a serious flaw: in some situations involving invalid "bow-tie" topology it will discard a large part of the input geometry.  This has been reported in quite a few issues (here and here in JTS, and also in  GEOS and Shapely).   
    Result of running DouglasPeuckerSimplifier on a polygon with a bow-tie. (See issue)

    Close-up of the result -  clearly undesirable.

    The problem occurs because the buffer algorithm computes the orientation of rings in order to build the buffer offset curve (in the case of a zero-width buffer this is just the original ring linework). Currently the Orientation.isCCW test is used to do this. This uses an efficient algorithm that determines ring orientation by checking the line segments incident on the uppermost vertex of the ring (see Wikipedia for an explanation of why this works.) For a valid ring (where the linework does not cross itself) this works perfectly. However, in a invalid self-crossing ring (sometimes called a "bow-tie" or "figure-8") a choice must be made about which lobe is assigned to be the "interior". The upper-vertex approach always picks the top lobe.  If that happens to be very small, the larger part of the ring is considered "exterior" and hence is removed by buffering.  

    A bow-tie polygon where buffer makes the evidently wrong choice for interior.
    The problem occurs for non-zero buffer distances as well.

    This problem has limped along for many years now, never being quite enough of a pain point to motivate the effort needed to find a fix (or to fund one!).  And to be honest, the buffer code is some of the most complicated and delicate in JTS, and I was concerned about wading into it to add what seemed poised to be a fiddly correction.

    But recently there has been renewed interest in providing a Make-Valid capability for JTS.  This inspired me to revisit the usage of buffer(0), and think more deeply about ring orientation and its role in determining valid polygonal topology. And this led to discovering a surprisingly simple solution for the buffer issue.

    The fix is to use an orientation test which takes into account the entire ring. This is provided by the Signed-Area Orientation test, implemented in Orientation.isCCWArea using the Shoelace Formula. This effectively determines orientation based on the largest area enclosed by the ring. This corresponds more closely to user expectation based on visual assessment. It also minimizes the change in area and (usually) extent.

    And indeed it works:


    It fixes the simplification issue nicely:


    The fix consists of about 4 lines of actual code.  To paraphrase a great orator, never in the history of JTS has so much benefit been given to so many by so few lines of code.  Now buffer(0) can be recommended unreservedly as an effective, performant way to fix polygonal geometry.  And all those helpful documentation pages can drop any qualifications they might have.

    As usual, this fix will soon show up in GEOS, and from there in PostGIS and other downstream projects.  

    This isn't the end of the story.  There are times when the effect of buffer(0) is not what is desired for fixing polygon topology.  This is discussed nicely in this blog post.  The ongoing research into Make Valid will explore alternatives and how to provide an API for them.