Wednesday, 4 May 2022

Using Outer Hulls for Smoothing Vectorized Polygons

The electrons were hardly dry on the JTS Outer and Inner Polygon Hull post when another interesting use case popped up on GIS StackExchange.  The question was how to remove aliasing artifacts (AKA "jaggies") from polygons created by vectorizing raster data, with the condition that the result should contain the original polygon.  

A polygon for Vancouver Island vectorized from a coarse raster dataset.  Aliasing artifacts are obvious.

This problem is often handled by applying a simplification or smoothing process to the "jaggy" polygon boundary.  This works, as long as the process preserves polygonal topology (e.g. such as the JTS TopologyPreservingSimplifier). But generally this output of this process does not contain the input polygon, since the simplification/smoothing can alter the boundary inwards as well as outwards.  

Simplification using TopologyPreservingSimplifier with distance = 0.1.  Artifacts are removed, but the simplified polygon does not fully contain the original.

In contrast, the JTS Polygon Outer Hull algorithm is designed to do exactly what is required: it reduces the number of vertices, while guaranteeing that the input polygon is contained in the result.  It is essentially a simplification method which also preserves polygonal topology (using an area-based approach similar to the Visvalingham-Whyatt algorithm).

Outer Hull using vertex ratio = 0.25.  Artifacts are removed, and the original polygon is contained in the hull polygon.

Here's a real-world example, taken from the GADM dataset for administrative areas of Germany.  The coastline of the state of Mecklenburg-Vorpommern appears to have been derived from a raster, and thus exhibits aliasing artifacts.  Computing the outer hull with a fairly conservative parameter eliminates most of the artifacts, and ensures polygonal topology is preserved.

A portion of the coastline of Mecklenburg-Vorpommern showing aliasing artifacts.  The Outer Hull computed with vertex ratio = 0.25 eliminates most artifacts, and preserves the coastline topology.

Future Work

A potential issue for using Outer Hull as a smoothing technique is the choice of parameter value controlling the amount of change.  The algorithm provides two options: the ratio of reduction in the number of vertices, or the fraction of change in area allowed.  Both of these are scale-independent, and reflect natural goals for controlling simplification.  But neither relate directly to the goal of removing "stairstep" artifacts along the boundary.  This might be better specified via a distance-based parameter. The parameter value could then be determined based on the known artifact size (i.e. the resolution of the underlying grid).  Since the algorithm for Outer Hull is quite flexible, this should be feasible to implement. 

Monday, 25 April 2022

Outer and Inner Concave Polygon Hulls in JTS

The JTS Topology Suite recently gained the ability to compute concave hulls.  The Concave Hull algorithm computes a polygon enclosing a set of points using a parameter to determine the "tightness".  However, for polygonal inputs the computed concave hull does not always respect the polygon boundaries.  So the concave hull may not actually contain the input polygon.

It would be useful to be able to compute the "outer concave hull" of a polygon.  This is a valid polygon formed by a subset of the vertices of the input polygon which fully contains the input polygon.  Vertices can be eliminated as long as the resulting boundary does not self-intersect, and does not cross into the original polygon.

An outer concave hull of a polygon representing Switzerland

As with point-set concave hulls, the vertex reduction is controlled by a numeric parameter. This creates a sequence of hulls of increasingly larger area with smaller vertex counts.  At an extreme value of the parameter, the outer hull is the same as the convex hull of the input.
A sequence of outer hulls of New Zealand's North Island

The outer hull concept extends to handle holes and MultiPolygons.  In all cases the hull boundaries are constructed so that they do not cross each other, thus ensuring the validity of the result.

An outer hull of a MultiPolygon for the coast of Denmark.  The hull polygons do not cross.

It's also possible to construct inner hulls of polygons, where the constructed hull is fully within the original polygon.

An inner hull of Switzerland

Inner hulls also support holes and MultiPolygons.  At an extreme value of the control parameter, holes become convex hulls, and a polygon shell reduces to a triangle (unless blocked by the presence of holes).

An inner hull of a lake with islands.  The island holes become convex hulls, and prevent the outer shell from reducing fully to a triangle

A hull can provide a significant reduction in the vertex size of a polygon for a minimal change in area. This could allow faster evaluation of spatial predicates, by pre-filtering with smaller hulls of polygons.

An outer hull of Brazil provides a 10x reduction in vertex size, with only ~1% change in area.

This has been on the JTS To-Do list for a while (I first proposed it back in 2009).  At that time it was presented as a way of simplifying polygonal geometry. Of course JTS has had the TopologyPreservingSimplifier for many years.  But it doesn't compute a strictly outer hull.  Also, it's based on Douglas-Peucker simplification, which isn't ideal for polygons.  

It seems there's quite a need for this functionality, as shown in these GIS-StackExchange posts (1, 2, 3, 4).  There's even existing implementations on Github: rdp-expansion-only and simplipy (both in Python) - but both of these sound like they have some significant issues. 

Recent JTS R&D (on concave hulls and polygon triangulation) has provided the basis for an effective, performant polygonal concave hull algorithm.  This is now released as the PolygonHull class in JTS.

The PolygonHull API

Polygon hulls have the following characteristics:

  • Hulls can be constructed for Polygons and MultiPolygons, including holes.
  • Hull geometries have the same structure as the input.  There is a one-to-one correspondence for  elements, shells and holes.
  • Hulls are valid polygonal geometries.
  • The hull vertices are a subset of the input vertices.

The PolygonHull algorithm supports computing both outer and inner hulls. 

  • Outer hulls contain the input geometry.  Vertices forming concave corners (convex for holes) are removed.  The maximum outer hull is the convex hull(s) of the input polygon(s), with holes reduced to triangles.
  • Inner hulls are contained within the input geometry.  Vertices forming convex corners (concave for holes) are removed.   The minimum inner hull is a triangle contained in (each) polygon, with holes expanded to their convex hulls.  

The number of vertices removed is controlled by a numeric parameter.  Two different parameters are provided:

  • the Vertex Number Fraction specifies the desired result vertex count as a fraction of the number of input vertices.  The value 1 produces the original geometry.  Smaller values produce simpler hulls.  The value 0 produces the maximum outer or minimum inner hull.
  • the Area Delta Ratio specifies the desired maximum change in the ratio of the result area to the input area.  The value 0 produces the original geometry.  Larger values produce simpler hulls. 
Defining the parameters as ratios means they are independent of the size of the input geometry, and thus easier to specify for a range of inputs.  Both parameters are targets rather than absolutes; the validity constraint means the result hull may not attain the specified value in some cases. 

Algorithm Description

The algorithm removes vertices via "corner clipping".  Corners are triangles formed by three consecutive vertices in a (current) boundary ring of a polygon.  Corners are removed when they meet certain criteria.  For an outer hull, a corner can be removed if it is concave (for shell rings) or convex (for hole rings).  For an inner hull the removable corner orientations are reversed.  

In both variants, corners are removed only if the triangle they form does not contain other vertices of the (current) boundary rings.  This condition prevents self-intersections from occurring within or between rings. This ensures the resulting hull geometry is topologically valid.  Detecting triangle-vertex intersections is made performant by maintaining a spatial index on the vertices in the rings.  This is supported by an index structure called a VertexSequencePackedRtree.  This is a semi-static R-tree built on the list of vertices of each polygon boundary ring.  Vertex lists typically have a high degree of spatial coherency, so the constructed R-tree generally provides good space utilization.  It provides fast bounding-box search, and supports item removal (allowing the index to stay consistent as ring vertices are removed).

Corners that are candidates for removal are kept in a priority queue ordered by area.  Corners are removed in order of smallest area first.  This minimizes the amount of change for a given vertex count, and produces a better quality result.  Removing a corner may create new corners, which are inserted in the priority queue for processing.  Corners in the queue may be invalidated if one of the corner side vertices has previously been removed; invalid corners are discarded. 

This algorithm uses techniques originated for the Ear-Clipping approach used in the JTS PolgyonTriangulator implementation. It also has a similarity to the Visvalingham-Whyatt simplification algorithm.  But as far as I know using this approach for computing outer and inner hulls is novel. (After the fact I found a recent paper about a similar construction called a Shortcut Hull [Bonerath et al 2020], but it uses a different approach).

Further Work

It should be straightforward to use this same approach to implement a Topology-Preserving Simplifier variant using the corner-area-removal approach (as in Visvalingham-Whyatt simplification).  The result would be a simplified, topologically-valid polygonal geometry.  The simplification parameter  limits the number of result vertices directly, or the net change in area.  The resulting shape should be a good approximation of the input, but would not necessarily be either wholly inside or outside.

Thursday, 27 January 2022

Cubic Bezier Curves in JTS

As the title of this blog indicates, I'm a fan of linearity.  But sometimes a little non-linearity makes things more interesting.  A convenient way to generate non-linear curved lines is to use Bezier Curves.  Bezier Curves are curves defined by polynomials.  Bezier curves can be defined for polynomials of any degree, but a popular choice is to use cubic Bezier curves defined by polynomials of degree 3.  These are relatively easy to implement, visually pleasing, and versatile since they can model ogee or sigmoid ("S"-shaped) curves.

A single cubic Bezier curve is specified by four points: two endpoints forming the baseline, and two control points.  The curve shape lies within the quadrilateral convex hull of these points.

Note: the images in this post are created using the JTS TestBuilder.

Cubic Bezier Curves, showing endpoints and control points

A sequence of Bezier curves can be chained together to form a curved path of any required shape. There are several ways to join composite Bezier curves.  The simplest join constraint is C0-continuity: the curves touch at endpoints, but the join may be a sharp angle.  C1-continuity (differentiable) makes the join smooth. This requires the control vectors at a join point to be collinear and opposite.  If the control vectors are of different lengths there will be a different radius of curvature on either side.  The most visually appealing join is provided by C2-continuity (twice-differentiable), where the curvature is identical on both sides of the join.  To provide this the control vectors at a vertex must be collinear, opposite and have the same length.

Bezier Curve with C2-continuity

A recent addition to the JTS Topology Suite is the CubicBezierCurve class, which supports constructing Bezier Curves from LineStrings and Polygons.  JTS only supports representing linear geometries, so curves must be approximated by sequences of line segments. (The buffer algorithm uses this technique to approximate the circular arcs required by round joins.)   

Bezier Curve approximated by line segments

Bezier curves can be generated on both lines and polygons (including holes):
Bezier Curve on a polygon

There are two ways of specifying the control points needed to define the curve:  

Alpha (Curvedness) Parameter

The easiest way to define the shape of a curve is via the parameter alpha, which indicates the "curvedness".  This value is used to automatically generate the control points at each vertex of the baseline.  A value of 1 creates a roughly circular curve at right angles.  Higher values of alpha make the result more curved; lower values (down to 0) make the curve flatter.

Alpha is used to determine the length of the control vectors at each vertex.  The control vectors on either side of the vertex are collinear and of equal length, which provides C2-continuity.  The angle of the control vectors is perpendicular to the bisector of the vertex angle, to make the curve symmetrical.  

Bezier Curve for alpha = 1
Bezier Curve for alpha = 1.3

Bezier Curve for alpha = 0.3

Explicit Control Points

Alternatively, the Bezier curve control points can be provided explicitly.  This gives complete control over the shape of the generated curve.  Two control points are required for each line segment of the baseline geometry, in the same order.  A convenient way to provide these is as a LineString (or MultiLineString for composite geometries) containing the required number of vertices.

Bezier Curve defined by control points, with C2 continuity

When using this approach only C0-continuity is provided automatically.  The caller must enforce C1 or C2-continuity via suitable positioning of the control points.

Bezier Curve defined by control points showing C0 and C1 continuity

Further Ideas

  • Allow specifying the number of vertices used to approximate each curve
  • Add a function to return the constructed control vectors (e.g. for display and analysis purposes)
  • Make specifying explicit control points easier by generating C2-continuous control vectors from a single control point at each vertex

Tuesday, 18 January 2022

Concave Hulls in JTS

A common spatial need is to find a polygon that accurately represents a set of points.  The convex hull of the points often does not provide this, since it can enclose large areas which contain no points.  What is required is a non-convex hull, often termed the concave hull.  

The Convex Hull and a Concave Hull of a point set

A concave hull is generally considered to have some or all of the following properties:

  • The hull is a simply connected polygon
  • It contains all the input points
  • The vertices in the hull polygon boundary are all input points
  • The hull may or may not contain holes
For a typical point set there are many polygons which meet these criteria, with varying degrees of concaveness.  Concave Hull algorithms provide a numeric parameter which controls the amount of concaveness in the result.  The nature of this parameter is particularly important, since it affects the ease-of-use in practical scenarios.  Ideally it has the following characteristics:

  • Simple geometric basis: this allows the user to understand the effect of the parameter and aids in determining an effective value
  • Scale-free (dimensionless): this allows a single parameter value to be effective on varying sizes of geometry, which is essential for batch or automated processing
  • Local (as opposed to global):  A local property (such as edge length) gives the algorithm latitude to determine the concave shape of the points. A global property (such as area) over-constrains the possible result. 
  • Monotonic area:  larger (or smaller) values produce a sequence of more concave areas
  • Monotonic containment :the sequence of hulls produced are topologically nested
  • Convex-bounded: an extremal value produces the convex hull

This is a well-studied problem, and many different approaches have been proposed.  Some notable ones are:

Of these, Delaunay Erosion (Chi-shapes) offers the best set of features.  It is straightforward to code and is performant.  It uses the control parameter of Edge Length Ratio, a fraction of the difference between the longest and shortest edges in the underlying Delaunay triangulation.  This is easy to reason about, since it is scale-free and corresponds to a simple property of the point set (that of distance between vertices).  It can be extended to support holes.  And it has a track record of use, notably in Oracle Spatial.  

ConcaveHull generated by Delaunay Erosion with Edge Length Ratio = 0.3

Recently the Park-Oh algorithm has become popular, thanks to a high-quality implementation in Concaveman project (which has spawned numerous ports).  However, it has some drawbacks.  It can't support holes (and likely not disconnected regions and discarding outlier points).  As the paper points out and experiment confirms, it produces rougher outlines than the Delaunay-based algorithm.  Finally, the control parameter for Delaunay Erosion has a simpler geometrical basis which makes it easier to use.

Given these considerations, the new JTS ConcaveHull algorithm utilizes Delaunay Erosion. The algorithm ensures that the computed hull is simply connected, and contains all the input points.  The Edge Length Ratio is used as the control parameter. A value of 1 produces the convex hull; 0 produces a concave hull of minimal size.  Alternatively the maximum edge length can be specified directly. This allows alternative strategies to determine an appropriate length value; for instance, another possibility is to use a fraction of the longest edge in the Minimum Spanning Tree of the input points.  

The recently-added Tri data structure provides a convenient basis for the implementation,.  It operates as follows:

  1. The Delaunay Triangulation of the input points is computed and represented as a set of of Tris
  2. The Tris on the border of the triangulation are inserted in a priority queue, sorted by longest boundary edge
  3. While the queue is non-empty, the head Tri is popped from the queue.  It is removed from the triangulation if it does not disconnect the area.  Insert new border Tris into the queue if they have a boundary edge length than the target length
  4. The Tris left in the triangulation form the area of the Concave Hull 

Thanks to the efficiency of the JTS Delaunay Triangulation the implementation is quite performant, approaching the performance of a Java port of Concaveman.  

Concave Hull of Ukraine dataset; Edge Length Ratio = 0.1

Optionally holes can be allowed to be present in the hull polygon (while maintaining a simply connected result).  A classic demonstration of this is to generate hulls for text font glyphs:

This algorithm is in the process of being ported to GEOS.  The intention is to use it to enhance the PostGIS ST_ConcaveHull function, which has known issues and has proven difficult to use.

Further Ideas

  • Disconnected Result - It is straightforward to extend the algorithm to allow a disconnected result (i.e. a MultiPolygon).  This could be provided as an option.
  • Outlier Points - It is also straightforward to support discarding "outlier" points.
  • Polygon Concave Hull - computing a concave "outer hull" for a polygon can be used to simplify the polygon while guaranteeing the hull contains the original polygon.  Additionally, an "inner hull" can be computed which is fully contained in the original.  The implementation of a Polygon Concave Hull algorithm is well under way and will be released in JTS soon. 

Monday, 3 January 2022

JTS Offset Curves

Offset curves (also known as parallel curves) are an oft-requested feature in JTS.  They are a natural extension to the concept of buffering, and are useful for things like placing labels along rivers.  As far as I know there is no hard-and-fast definition for how an offset curve should be constructed, but a reasonable semantic would seem to be "a line lying on one side of another line at a given offset distance"

Here's an image of offset curves on both sides of a river reach from a US rivers dataset:

GEOS Implementation

GEOS acquired an implementation to produce offset curves a few years ago.  However, it has some known bugs (such as producing disconnected curves, and including extraneous segments).  It also has the feature (or quirk?) of potentially producing a result with multiple disjoint curves for a single input line. 

The GEOS implementation is based on the concept that an offset curve is just a portion of the corresponding buffer boundary.  So to generate an offset curve the algorithm extracts a portion of the buffer polygon boundary.  The trick is deciding which portion!  

The GEOS implementation generates a raw offset curve (potentially full of self-intersections and unwanted linework) and then determines the intersection of that curve with the buffer boundary.  However, the use of intersection on linework is always tricky.  The buffer geometry is necessarily only a (close) approximation, and the buffer algorithm takes advantage of this to use various heuristics to improve the quality and robustness of the generated buffer linework.  This can cause the buffer linework to diverge from the raw offset curve.  The divergence makes the intersection result susceptible to errors caused by slight differences between the generated curves. The two issues above are caused by this limitation. 

JTS Implementation

Instead of using intersection, an effective technique is to match geometry linework using a distance tolerance.  This is the approach taken in the new JTS Offset Curve algorithm. The high-level design is

  1. The buffer is generated at the offset distance
  2. The raw offset curve is generated at the same distance
  3. The raw curve segments are matched to the buffer boundary using a distance tolerance
  4. The offset curve is the section of the buffer boundary between the first and last matching points.
To make the matching efficient a spatial index is created on the buffer curve segments.  

This algorithm provides the following semantics:
  • The offset curve of a line is a single LineString
  • The offset curve lies on one side of the input line (to the left if the offset distance is positive, and to the right if negative)
  • The offset curve has the same direction as the input line
  • The distance between the input line and the offset curve equals the offset distance (up to the limits of curve quantization and numerical precision)
This algorithm does a fine job at generating offset curves for typical simple linework.  The image below shows a family of offset curves on both sides of a convoluted line.  
This resolves both of the GEOS code issues.  It also supports parameters for join style (round, bevel, and mitre), number of quadrant segments (for round joins) and mitre limit:
Join Style = MITRE
Join Style = ROUND; Quadrant Segments = 2

There are also a few nuances required to handle some tricky situations; in particular, cases when the input line string curves back on itself and when it self-intersects.  These do not have a clear definition for what the result should be.  Moreover, the algorithm itself imposes some constraints on how these cases can be handled.  The images below show how the algorithm behaves in these cases.

"Loopbacks"produce offset curves representing only the "exposed" sections of the input linework: 
Offset Curves of a Line with "loop-backs"

For a self-intersecting input line, the offset curve starts at the beginning of the input line on the specified side, and continues only up to where the line side changes due to the crossing.  The length of the offset curve is also reduced by the requirement that it be no closer than specified offset distance to the input line: 
Offset Curves of a Line with a self-intersection

This algorithm is now in JTS, and has been ported to GEOS.

Wednesday, 22 December 2021

Christmas Wrapping

 Every so often I produce an image in the JTS TestBuilder which strikes me as worthy of capture.  Here's one that seems pretty seasonal:

It is generated like this:
  1. Produce two sets of 1000 random points roughly aligned with a grid
  2. Compute their fully-eroded Convex Hulls
  3. Compute the intersection of the two hulls
  4. Theme the intersection with random fill

Monday, 1 November 2021

JTS Polygon Triangulation, at last

A (long) while ago I posted about "soon-to-be-released" JTS code for polygon triangulation using Ear Clipping.  It turned out it was actually in the category of "never-to-be-released".  However, later  I worked with a student, Dan Tong, on a coding exercise sponsored by Facebook. We decided to tackle polygon triangulation.  By the end of the project he implemented a functional algorithm, including handling holes (via the hole-joining technique described by Eberly).  To make it release-ready the code needed performance and structural improvements.  This has finally happened and the code is out!  

Polygon Triangulation

The new codebase has been substantially rewritten to improve modularity and the API.  A new data structure to represent triangulations has been introduced, called a Tri  It is simpler, more memory efficient, and easier to work with than the QuadEdge data structure previously used in JTS. It provides an excellent basis for developing further algorithms based on triangulations (of which more below)

Most of the reworking involved implementing key performance optimizations, utilizing existing and new JTS spatial index structures: 

  • the Hole Joining phase now uses a spatial index to optimize the test for whether a join line is internal to a polygon
  • the Ear-Clipping phase has to check whether a candidate ear contains any vertices of the polygon shell being clipped.  This has been made performant by using a new spatial index structure called the VertexSequencePackedRtree

Performance must be benchmarked against an existing implementation.  Geometry maestro Vladimir Agafonkin has developed an Ear Clipping algorithm in Javascript called Earcut.  It's claimed to be the fastest implementation in Javascript.  Indeed the performance comparison shows Earcut to be over 4 times faster than some other Javascript triangulation implementations.  Earcut has been ported to Java as earcut4j, which allows comparing JTS to it.   

DatasetSizeJTS TimeEarcut4j Time
Lake Superior3,478 vertices / 28 holes18 ms13 ms
Canada10,522 vertices35 ms26 ms
Complex polygon44,051 vertices / 125 holes1.1 s0.64 s
North America with Lakes115,206 vertices / 1,660 holes13.3 s5.4 s

So JTS ear clipping is not quite as fast as Earcut4j.  But it's still plenty fast enough for production use!

The triangulation algorithm is exposed as the PolygonTriangulator class (deliberately named after the effect, rather than the algorithm producing it).  Here's an example of the output:

Ear-Clipping Triangulation of Lake Superior

Constrained Delaunay Triangulation

The output of PolygonTriangulator is guaranteed to be a valid triangulation, but in the interests of performance it does not attempt to produce a high-quality one.  Triangulation quality is measured in terms of minimizing the number of "skinny" triangles (or equivalently, maximizing the sum of the interior angles).  Optimal quality is provided by Constrained Delaunay Triangulation (CDT).  In addition to being optimal in terms of triangle shape, the CDT has the nice properties of being (essentially) unique, and of accurately representing the structure of the input polygon.  It also removes the influence that Hole Joining holes has on the raw triangulation provided by Ear-Clipping.

I originally proposed that a CDT can be achieved by a Delaunay Improvement algorithm based on iterated triangle flipping.  It turned out that this technique is not only effective, but also pleasingly performant.  This is implemented in the ConstrainedDelaunayTriangulator class.  Running it on the example above shows the improvement obtained.  Note how there are fewer narrow triangles, and their arrangement more closely represents the structure of the polygon.

Constrained Delaunay Triangulation of Lake Superior

This code will appear in the next JTS version (1.18.3 or 1.19), and has already been released in GEOS 3.10.

Future Work

It seems possible to adapt the machinery for Ear Clipping to provide two other useful constructions:

  • Polygon Concave Hull ("Outer Hull") - this requires a triangulation of the area between the polygon and its convex hull.  That can be obtained by running Ear Clipping outwards, rather than inwards. A Concave Hull can be constructed by limiting the number of triangles generated based on some criteria such as area, longest edge, or number of vertices.  The result is a simplified polygon with fewer vertices which is guaranteed to contain the original polygon. 
  • Polygon Inner Hull - the triangulation constructed by Ear Clipping can be "eroded" to provide a simpler polygon which is guaranteed to be contained in the original.
The property of the Constrained Delaunay Triangulation of representing the structure of the input polygon could provide a basis for computing the following constructions: 
  • Approximate Skeleton construction (also known as the Medial Axis) by creating edges through the CDT triangles
  • Equal-Area Polygon Subdivision, by partitioning the graph induced by the CDT
The Tri data structure provides a simple representation of a triangulation.  I expect that it will facilitate the development of further JTS triangulation algorithms, such as:
  • Concave Hull of Point Set
  • Constrained Delaunay Triangulation of Lines and Points

Saturday, 9 October 2021

Query KD-trees 100x faster with this one weird trick!

Recently a GEOS patch was contributed to change the KdTree query implementation to use an explicit stack rather than recursion.  This has been ported to JTS as PR #779 (along with some refactoring).

The change was motivated by a QGIS issue in which a union of some large polygons caused a stack overflow during a KdTree query.  The reason is that the poor vertex alignment of the input polygons causes the union overlay process to invoke the SnappingNoder.  (For background about why this occurs see the post OverlayNG - Noding Strategies).  The snapping noder snaps vertices using a KdTree with a tolerance.  The polygon boundaries contain runs of coherent (nearly-monotonic) vertices (which is typical for high-resolution polygons delineating natural areas).  When these are loaded directly into a KdTree the tree can become unbalanced.  

An unbalanced tree has a relatively large depth for its size.  The diagrams below shows the graph of the KdTree for a polygon containing 10,552 vertices.  The unbalanced nature of the tree is evident from the few subtrees extending much deeper than the rest of the tree. The tree depth is 282, but most of that occurs in a single subtree.

An unbalanced KD-tree (size = 10,552,  depth = 282)

Left: full graph.     Right: close-up of graph structure

For large inputs querying such a deep subtree causes the recursive tree traversal to exceed the available call stack. Switching to an iterative implementation eliminates the stack overflow error, since the memory-based stack can grow to whatever size is needed.

However, stack overflow is not the only problem!  Unbalanced KD-trees also exhibit poor query performance.  This is because deep tree traversals make the search runtime more linear than logarithmic.   Increasing the size of the query stack increases robustness, but does not improve the performance problem.  The best solution is to build the KD-tree in a balanced way in the first place. 

One way to do this is to to break up monotonic runs by randomizing the order of vertex insertion.  This is actually implemented in the OverlayNG SnapRoundingNoder, as discussed in the post Randomization to the Rescue.  However, using this approach in the SnappingNoder would effectively query the tree twice for each vertex.  And in fact it's not necessary to randomize all the inserted points.  A better-balanced tree can be produced by "seeding" it with a small set of well-chosen points.  

But how can the points be chosen?   They could be selected randomly, but true randomness can be quite "clumpy".  Also, a non-deterministic algorithm is undesirable for repeatability.  Both of these issues can be avoided by using a low-discrepancy quasi-random sequence.  This concept has a fascinating theory, with many approaches available.  The simplest to implement is an additive recurrence sequence with a constant α (the notation {...} means "fractional value of"):

R(α) :  tn = { t0 + n * α },  n = 1,2,3,...    

If α is irrational the sequence never repeats (within the limits of finite-precision floating point).  The excellent article The Unreasonable Effectiveness of Quasirandom Sequences suggests that the lowest possible discrepancy occurs when α is the reciprocal of the Golden Ratio φ. 

1 / φ = 0.6180339887498949...

Implementing this is straightforward, and it is remarkably effective.  For the example above, the resulting KD-tree graph is significantly better-balanced, with lower depth (85 versus 282):

Better-balanced KD-tree (size = 10,552,  depth = 85)

Adding KD-tree seeding to the SnappingNoder gives a huge performance boost to the QGIS test case.  For the 526,466 input vertices the tree depth is reduced from 17,776 to 183 and the snapping time is improved by about 100 times, from ~80 s to ~800 ms!  (In this case the overall time for the union operation is reduced by only about 50%, since the remainder of the overlay process is time-consuming due to the large number of holes in the output.)

In GEOS the performance difference is smaller, but still quite significant.  The snapping time improves 14x, from 12.2 s to 0.82 s, with the union operation time decreasing by 40%, from 31 to 18 secs.

This improvement will be merged into JTS and GEOS soon.

Friday, 16 July 2021

JTS IsValidOp Built Back Better

In a previous post I described how the JTS Topology Suite operation IsSimpleOp has been completely rewritten to reduce code dependencies, improve performance, and provide a simpler, more understandable implementation.  The post points out that the IsValidOp implementation would benefit from the same treatment.  This work has now been carried out, with similar benefits achieved.  

The original IsValidOp code used the GeometryGraph framework to represent the topology of geometries.  This code reuse reduced development effort, but it carried some serious drawbacks:

  • The GeometryGraph structure was used for many JTS operations, including overlay, buffer, and spatial predicate evaluation.  This made the code complex, hard to understand, and difficult to maintain and enhance
  • The GeometryGraph structure computes the full topology graph of the input geometry, in order to support constructive operations such as overlay and buffer.  This makes it slower and more subject to robustness problems.  Non-constructive operations such as IsValidOp and IsSimpleOp can compute topology "on-the-fly", which allows short-circuiting processing as soon as an invalidity is found.  
The new IsValidOp implementation is more self-contained, relying only on the noding framework and some basic spatial predicates.  Using the MCIndexNoder allows short-circuited detection of some kinds of invalidity, which improves performance.  And it will be easy to switch to a different noding strategy should a superior one arise.
However, dropping GeometryGraph means that the topology information required to confirm validity needs to be computed some other way.  The goal is to compute just enough topological structure to evaluate validity, and do that efficiently on an as-needed basis.  This required some deep thinking about validity to pare the topology information required down to a minimum.  This was assisted by the extensive set of unit tests for IsValidOp, which ensured that all possible situations were handled by the new logic (although there was an untested situation which uncovered a bug late in development.) The resulting algorithm uses some elegant logic and data structures, which are explained in detail below. 

OGC Validity

JTS implements the OGC Simple Features geometry model.  To understand the validity algorithm, it helps to review the rules of geometry validity in the OGC specification:
For non-polygonal geometry types, the validity rules are simple;
  1. Coordinates must contain valid numbers
  2. LineStrings must not be zero-length
  3. LinearRings must be simple (have no self-intersections)
For polygonal geometry, on the other hand, the validity rules are quite stringent (sometimes considered too much so!).  The rules are:
  1. Rings must be simple (have no self-intersections)
  2. Rings must not cross, and can touch only at discrete points
  3. Holes must lie inside their shell
  4. Holes must not lie inside another hole of their polygon
  5. The interiors of polygons must be connected
  6. In a MultiPolygon, the element interiors must not intersect (so polygons cannot overlap)
  7. In a MultiPolygon, the element boundaries may intersect only at a finite number of points

Validity Algorithm for Polygonal Geometry

The new IsValidOp uses an entirely new algorithm to validate polygonal geometry.  It has the following improvements:
  • It supports short-circuiting when an invalid situation is found, to improve performance
  • Spatial indexing is used in more places
  • The code is simpler and more modular.  This makes it easier to understand, and easier to adapt to new requirements (e.g. such as the ability to validate self-touching rings described below)
The algorithm consists of a sequence of checks for different kinds of invalid situations.  The order of the checks is important, since the logic for some checks depends on previous checks passing. 

1. Check Ring Intersections

The first check determines if the rings in the geometry have any invalid intersections.  This includes the cases of a ring crossing itself or another ring, a ring intersecting itself or another ring in a line segment (a collinear intersection), and a ring self-touching (which is invalid in the OGC polygon model).  This check can use the robust and performant machinery in the JTS noding package.  If an invalid intersection is found, the algorithm can return that result immediately.

Kinds of invalid intersections: 
(i) a self-touch; (ii) collinear; (iii) segment crossing; (iv) vertex crossing

This check is an essential precursor to all the remaining checks.  If a geometry has no invalid intersections this confirms that its rings do not self-cross or partially overlap.  This means that the results of orientation and point-in-polygon computations are reliable.  It also indicates that the rings are properly nested (although it remains to be determined if the nesting of shells and holes is valid).

Intersection search is the most computationally expensive phase of the algorithm.  Because of this, information about the locations where rings touch is saved for use in the final phase of checking connected interiors.  

2. Check Shell and Hole nesting

If no invalid intersections are found, then the rings are properly nested.  Now it is necessary to verify that shells and holes are correctly positioned relative to each other:
  • A hole must lie inside its shell
  • A hole may not lie inside another hole. 
  • In MultiPolygons, a shell may not lie inside another shell, unless the inner shell lies in a hole of the outer one.  
Ring position can often be tested using a simple Point-In-Polygon test.  But it can happen that the ring point tested lies on the boundary of the other ring.  In fact, it is possible that all points of a ring lie on another ring.  In this case the Point-In-Polygon test is not sufficient to provide information about the relative position of the rings.  Instead, the topology of the incident segments at the intersection is used to determine the relative position.  Since the rings are known to not cross, it is sufficient to test whether a single segment of one ring lies in the interior or exterior of the other ring.

MultiPolygon with an element where all vertices lie on the boundary
Checking nesting of holes and shells requires comparing all interacting pairs of rings.  A simple looping algorithm has quadratic complexity, which is too slow for large geometries.  Instead, spatial indexes (using the STRtree) are used to make this process performant. 

3. Check Connected Interior 

The final check is that the interior of each polygon is connected.  In a polygon with no self-touching rings, there are two ways that interior disconnection can happen:

  • a chain of one or more touching holes touches the shell at both ends, thus splitting the polygon in two
  • a chain of of two or more touching holes forms a loop, thus enclosing a portion of the polygon interior  
To check that these conditions do not occur, the only information needed is the set of locations where two rings touch.  These induce the structure of an undirected graph, with the rings being the graph vertices, and the touches forming the edges.  The situations above correspond to the touch graph containing a cycle (the splitting situation is a cycle because the shell ring is also a graph vertex).  (Equivalently, the touch graph of a valid polygon is a tree, or more accurately a forest, since there may be sets of holes forming disconnected subgraphs.) So a polygon can be verified to have a connected interior by checking that its touch graph has no cycles.  This is done using a simple graph traversal, detecting vertices (rings) which are visited twice.
Situations causing a disconnected interior: 
(i) a chain of holes; (ii) a cycle of holes
If no invalid situations are discovered during the above checks, the polygonal geometry is valid according to the OGC rules.

Validity with Self-touching Rings

Some spatial systems allow a polygon model in which "inverted shells" are valid.  These are polygon shells which contain a self-touch in a way that encloses some exterior area.  They also allow "exverted holes", which contain self-touches that disconnect the hole into two or more lobes.  A key point is that they do not allow "exverted shells" or "inverted holes"; self-touches may disconnect the polygon exterior, but not the interior.  Visually this looks like:
Left: Inverted shell and exverted hole (valid)
Right: Exverted shell and inverted hole (invalid)

This kind of topology is invalid under the OGC polygon model, which prohibits all ring self-intersections.  However, the more lenient model still provides unambiguous topology, and most spatial algorithms can handle geometry of this form perfectly well.  So there is a school of thought that considers the OGC model to be overly restrictive.
To support these systems JTS provides the setSelfTouchingRingFormingHoleValid option for IsValidOp to allow this topology model to be validated.  Re-implementing this functionality in the new codebase initially presented a dilemma. It appeared that building the touch graph to check connectivity required computing the geometry of the additional rings formed by the inversions and exversions.  This needed a significant amount of additional code and data structures, eliminating the simplicity of the graph-based approach.  
However, it turns out that the solution is much simpler.  It relies on the fact that valid self-touches can only disconnect the exterior, not the interior.  This means that self-touches of inverted shells and exverted holes can be validated by the local topology at the self-intersection node.  This condition can be tested by the same logic already used to test for nested touching shells.  Even better, if the self-touches are valid, then the touch-graph algorithm to check connectivity still works.   (This is the same phenomenon that allows most spatial algorithms to work correctly on the inverted/exverted ring model.)  


As expected, the new codebase provides better performance, due to simpler code and the ability to short-circuit when an invalidity is found.  Here's some performance comparisons for various datasets:

DataSizeNew (ms)Old (ms)Improvement
world 244 Polygons, 366K vertices 3401250   3.7 x
invalid-polys 640 Polygons, 455K vertices 126334   2.6 x
valid-polys 640 Polygons, 455K vertices 244487   2 x
australia 1 MultiPolygon, 1,222K vertices 132169026   52 x

  • world is the standard dataset of world country outlines
  • invalid-polys is a dataset of all invalid polygons.  valid-polys is the same dataset processed by GeometryFixer.  The timings show the effect of short-circuiting invalidity to improve performance. 
  • australia is a geometry of the Australian coastline and near-shore islands, with 6,697 polygon elements.  The dramatic difference in performance reflects the addition of a spatial index for checking nested shells.
Australia and 6,696 islands

Next Steps

The improved IsValidOp will appear in JTS version 1.19.  It has already been ported to GEOS, providing similar performance improvements.  And since GEOS backs the PostGIS ST_IsValid function, that will become faster as well.

There is one remaining use of GeometryGraph in JTS for a non-constructive operation:  the RelateOp class, used to compute all topological spatial predicates.  Converting it should provide the same benefits of simpler code and improved performance.  It should also improve robustness, and allow more lenient handling of invalid inputs.  Watch this space!

Wednesday, 26 May 2021

JTS IsSimple gets simpler (and faster)

Hard to believe that the JTS Topology Suite is almost 20 years old.  That's 140 in dog years!  Despite what they say about old dogs, one of the benefits of longevity is that you have the opportunity to learn a trick or two along the way.  One of the key lessons learned after the initial release of JTS is that intersection (node) detection is a fundamental part of many spatial algorithms, and critical in terms of performance.  This resulted in the development of the noding package to provide an API supporting many different kinds of intersection detection and insertion.

Prior to this, intersection detection was performed as part of the GeometryGraph framework, which combined it with topology graph formation and analysis.  At the time this seemed like an elegant way to maximize code reuse across many JTS operations, including overlay, buffering, spatial predicates and validation.  But as often the case, there are significant costs to such general-purpose code:

  • The overall codebase is substantially more complex
  • A performance penalty is imposed on algorithms which don't require topology construction
  • Algorithms are harder to read and understand.  
  • The code is brittle, and so hard to modify
  • Porting the code is more difficult  
Because of this, a focus of JTS development is to free operations from their dependency on GeometryGraph - with the ultimate goal of expunging it from the JTS codebase.  A major step along this road was the rewrite of the overlay operations.

Another operation that relies on GeometryGraph is the IsSimpleOp class, which implements the OGC Simple Features isSimple predicate.  The algorithm for isSimple essentially involves determining if the geometry linework contains a self-intersection.  GeometryGraph is unnecessarily complex for this particular task, since there is no need to compute the entire topology graph in order to find a single self-intersection.  Reworking the code to use the the MCIndexNoder class in the noding API produces a much simpler and more performant implementation. I also took the opportunity to move the code to the operation.valid package, since the operations of isSimple and isValid are somewhat complementary.

Now, isSimple is probably the least-used OGC operation.  Its only real use is to test for self-intersections in lines or collections of lines, and that is not a critical issue for many workflows.  However, there is one situation where it is quite useful: testing that linear network datasets are "vector-clean" - i.e. contain LineStrings which touch only at their endpoints.

A linear network containing non-simple intersections (isSimple == false)

To demonstrate the performance improvement, I'll use a dataset for Rivers of the US maintained by the US National Weather Service.  It supplies two datasets: a full dataset of all rivers, and a subset of major rivers only.  You might expect a hydrographic network to be "vector-clean", but in fact both of these datasets contain numerous instances of self-intersections and coincident linework.

Here's the results of running the isSimple predicate on the datasets.  On the larger dataset the new implementation provides a 20x performance boost!

 Dataset New time  Old time 
  Subset  (909,865 pts)   0.25 s   1 s
  Full   (5,212,102 pts)  2 s  30 s

Finding Non-Simple Locations

The new codebase made it easy to add a functionality enhancement that computes the locations of all places where lines self-intersect.  This can be used to for visual confirmation that the operation is working as expected, and to indicate places where data quality needs to be improved.  Here's the non-simple intersection points found in the river network subset:

Closeups of some non-simple intersection locations:

IsSimpleOp is the easiest algorithm to convert over from using GeometryGraph.  As such it serves as a good proof-of-viability, and establishes useful code patterns for further conversions. 

Next up is to give IsValidOp the same treatment.  This should provide similar benefits of simplicity and performance.  And as always, porting the improved code to GEOS.