Showing posts with label computational geometry. Show all posts
Showing posts with label computational geometry. Show all posts

Monday, March 3, 2008

Branch-and-Bound algorithms for Nearest Neighbour queries

This paper by Roussopoulos et al is a fine expositions of how to use a Branch-and-Bound algorithm in conjunction with an R-tree index to efficiently perform Nearest-Neighbour queries on a spatial database.


Nearest-neighbour is a pretty standard query for spatial database. Oracle Spatial offers this capability natively. It would be great if PostGIS did too. If the GIST index API offers appropriate access methods, it seems like it might not be too hard to implement the Roussopoulos algorithm. (How about it, Paul, now that you're dedicating your life to becoming a PostGIS coding wizard...)

I'm also thinking that this approach might produce an efficient algorithm for computing distance between Geometrys for JTS. It's always bugged me that the JTS distance algorithm is just plain ol' O(N^2) brute-force. It seems like there has to be a better way, but as usual there seems to be precious little prior art out there in web-land. This should also work well in the "Prepared" paradigm, which means that it will provide an efficient implementation for PostGIS as well. Soon, hopefully...

Wednesday, December 12, 2007

New Buffer Styles in JTS 1.9

JTS allows specifying styling parameters to control the appearance of generated buffer polygons.

End Cap styles have been available for a while. End caps can be Round (the default), Flat, or Square. For Round caps, the Quadrant Segments parameter controls the degree of approximation to a true circular arc.

End Cap Styles: Round, Flat, Square

JTS 1.9 adds the ability to specify the Join Style. The Join style controls how the offset curves for consecutive line segments are joined together. The Join type can be Round (the default), Mitre, or Bevel. If Mitre is specified, a further Mitre Limit parameter is provided to control how long mitred joins can be before they are beveled off. The maximum length of a mitred corner is equal to the Mitre Limit multiplied by the buffer distance.

Join Styles: Mitre (limit = 10), Mitre (limit = 1), Bevel

The JTS TestBuilder has been enhanced to allow specifying all of these parameters, so it can be used to experiment with the visual effects of the various styles.

Thursday, November 15, 2007

Fast polygon merging in JTS using Cascaded Union

A common requirement in spatial processing is to union a set of polygons together. Depending on the use case, the polygons may either be a coverage or be overlapping. Up until recently, there were two techniques for accomplishing this in JTS:
  • Iterated Union: Iterate over the polygons and union them one-by-one into a result polygon
  • Buffer Union: Collect the polygons into a GeometryCollection and run buffer(0) on the collection
The first one is straightforward, but quite inefficient for large input sets. The second technique is more efficient in many cases, but still tends to fail for large or complex data.

Since this is such a common operation, I've been keen to find a faster and more robust approach. An alternative strategy is to use a concept I call Cascaded Union. The idea is to union small subsets of the input together, then union groups of the resulting polygons, and so on recursively until the final union of all areas is computed. This can be thought of as a post-order traversal of a tree, where the union is performed at each interior node. If the tree structure is determined by the spatial proximity of the input polygons, and there is overlap or adjacency in the input dataset, this algorithm can be quite efficient. This is because union operations higher in the tree are faster, since linework is "merged away" from the lower levels.

This picture shows the algorithm in action:


Handily, the spatial indexes already in JTS can be used to determine an appropriate tree structure. Currently the Sort-Tile-Recursive packed R-Tree index is used - this seems to produce an effective tree (although I suspect that the algorithm is not very senstive to the precise spatial tree used)

It turns out that Cascaded Union can be much more efficient than Iterated Union. On a test dataset with 30,000 small polygons with a high degree of overlap, the performance results were:

Cascaded Union: 20 sec
Iterated Union: 3 hours 40 min !

Not bad - well worth the effort of coding... Also, it's more robust than Buffer Union, and often much faster as well.

This will be provided via the Geometry.union() (Unary union) method in JTS 1.9.

Wednesday, November 14, 2007

Implementations of Polygon Overlay algorithms

Occasionally people ask about the algorithms used in JTS's overlay operations. I don't have a good answer for this, since I basically made it up (in an informed kind of way), and haven't had time to document it to the level I'd like (thank goodness for self-documenting code... 8^)

I recently came across this web site from back in 1998, which is happily still around. It has a good list of implementations of what it calls Polygon Boolean operations. (This is the same thing as the JTS overlay operations - I chose to use a different term to avoid confusion with the spatial predicates). Perhaps some of these sites have done a better job of documentation! There's also some useful references to academic papers covering aspects of the issue.

Friday, November 9, 2007

The bible of Spatial Indexing

The Bhagavad Gita of Multidimensional Access Methods...
The Koran of Quadtrees...

One of the people I met at ACM GIS 2007 was the conference chair, Hanan Samet. He's published a series of highly useful books on spatial indexing methods, which are certainly the most comprehensive and detailed in this field. His latest work is Foundations of Multidimensional and Metric Data Structures and is surely his magnum opus. It could also be called Everything you wanted to know about Spatial Indexes (but were afraid to code). Recommended reading for spatial geeks, especially if you're looking for some reading to tide you over during your next year off.

His first book on the subject,
The Design and Analysis of Spatial Data Structures, is also worth looking at, especially since it's a bit more bite-sized. However, it's out-of-print, and expensive when you do find a copy (local plug: I found mine through ABE) If you really can't find a copy you might try contacting Hanan directly.

Hanan has a clear preference for quadtree structures over R-trees, but both approaches are covered well (including pseudo-code - yay!)






Thursday, September 13, 2007

Geodetic data in PostGIS - Spherical indexing schemes

Here at Refractions we're starting to think about how we can provide better support in PostGIS for geodetic data. Geodetic data is data which is defined in a true spherical coordinate system (in particular, of course, the surface of the Earth).

Currently PostGIS provides some geodetic-aware functions (such as distance between two geodetic points). But it's current spatial data model is fundamentally planar, so there are definite limitations to modelling geodetic data (e.g. such as the notorious "line crossing the Date Line" problem). As PostGIS gets used for larger datasets and more ambitious projects, the utility of having a full-function geodetic data model is becoming increasingly obvious.

Handling geodetic data in a correct and efficient way presents quite a few challenges. A major one is: how can geodetic geometry be spatially indexed? Conventional spatial indexes (such as 2D R-trees) all rely on geometry being embedded in a planar space. They don't handle data which can "wrap around", as can occur in a spherical space.

There have been various clever proposals for spherical indexing strategies. Some prominent ones are listed below:
  • Hierarchical Triangular Mesh - this is essentially a "quad-tree for the sphere". It has a lot of appeal for use with point data, since it provides a hierarchical key which can be indexed using a conventional B-tree index. (It was co-developed by the late, great Jim Gray in order to index astronomical data). The mathematics to determine the index key for a non-point object would seem to be somewhat complicated. It also seems like HTM would suffer from the usual disadvantage of quadtrees of not being very self-tuning. Another disadvantage from the PostGIS point of view is that this would likely be a brand new index type (i.e. lots of difficult code to write)
  • Hipparchus Voronoi-based index. This index can be thought of as a fixed-grid index using a custom Voronoi cell coverage for the globe. IBM's DB/2 Geodetic extension uses this scheme. I must say that this concept, while ingenious, seems a bit baroque to me. This index has the usual disadvantage of fixed-grid indexes of not handling widely-varying data sizes well. And it also requires extremely complex cell coverage structures, which have to be selected specifically for the expected data composition. DB/2 supplies 13 different ones based on various data densities (G7 industrial output, anyone?). I'm not sure what you are supposed to do if your data has some other density distribution - it doesn't seem very feasible to make your own Voronoi grid. And what if you don't know your data distribution, or it changes over time?

  • 3D Bounding Box - this is the approach used by the pgSphere project. It's pretty straightforward. The key concept is to embed the sphere in 3-space, so that it is possible to compute 3D bounding boxes for geometries on the embedded sphere. The bounding boxes can then be indexed using a 3D R-tree (exactly analogous to a 2D R-tree spatial index). The GIST index supplied with PostgreSQL can be customized to provide the required 3D R-tree. One possible issue is that apparently R-trees become "less effective" in higher-dimensional spaces. It remains to be seen whether this is truly a serious problem.
Out of these options, it seems like the 3D Bounding Box approach is the most straightforward scheme. There's some challenges in developing the mathematics required (at least, for a planar/linear guy like me), but I'm hopeful that we can deal with these issues and arrive at a efficient, maintainable solution.

Tuesday, August 21, 2007

Streaming Delaunay Triangulation for Huge Datasets

Isenburg, Liu, Shewchuk and Snoeyink have an interesting paper on contructing Delaunay triangulations over huge datasets using a streaming technique.


I'm quite interested in this, because here at Refractions I've just finished working on a project which involved computing a TIN for the province of B.C. using 500 million masspoints. After briefly flirting with the idea of computing a single seamless TIN, I ended up computing it in a piecewise fashion. This probably turned out to be a better approach for other reasons (such as the fact that we could easily parallelize the process). However, it would be very cool to be able to compute a seamless TIN for the entire province. And the approach in this paper sounds very effective - they quote a billion triangle in 48 minutes on a laptop!

It's also exciting to see the emergence of a whole new class of algorithms focussed on dealing with VLSD's (very large spatial datasets). As the same authors point out in another paper, paradoxically, advances in computing technology can make geospatial processing more difficult
because the rate of growth of dataset size far outstrips memory capacity.

Incidentally, these guys are a bit of a dream team for computational geometry algorithms. Shewchuck is the developer of Triangle, one of the best known programs for computing constrained 2D Delaunay triangulations (which is also AFAIK one of the few available systems to seriously address robustness issues - JTS of course being another one!) Snoeyink has done research in all kinds of different areas of CG, including interesting work in computing watershed boundaries on TINs using B.C. data.

The paper also has a nice way of showing the spatial distribution of two related variables, using fill gradients:

Wednesday, August 8, 2007

PreparedGeometry - efficient batch spatial operations for JTS

A common usage pattern for JTS spatial operations (in particular, predicates such as intersects and contains, and overlay operations such as intersection and union) involves the operation being invoked between a single target geometry and a batch of test geometries. For instance, this occurs in queries in a spatial database such as PostGIS, where each geometry in the driving table interacts with a set of nearby geometries via the required operation. By taking advantage of the fact that the target geometry does not change, it is possible to significantly reduce the overall execution cost of the operation. Information such as an index on the line segments in the target geometry can be computed once and cached for reuse with all of the test geometries.

In order for this to work, the calling code must be able to indicate that the operation is being invoked in a batch setting. The standard JTS Geometry API doesn't allow differentiating this case, so the new concept of PreparedGeometry has been developed. (This name intentionally echoes the concept of prepared statement in SQL API's, which implements the same strategy.) Creating a PreparedGeometry on a target Geometry enables precomputation of various data structures, which then allows subsequent operations to be evaluated with maximum efficiency.

In addition to caching, it is also possible to optimize the evaluation of spatial predicates. Optimization takes the form of short-circuiting predicate evaluation when certain situations are detected. While not all OGC spatial predicates are amenable to optimization, fortunately the most commonly usd predicates, intersects and contains, can be optimized significantly. Examples of situations which can be optimized are:
  • For intersects, any intersection between line segments indicates a true result. If there is no intersection between any line segments of the operands, and the target geometry is polygonal, then one or more fast point-in-polygon tests can be performed to compute the result value.
  • Since it provides more information than intersects, the contains predicate is more complex to optimize. In the case where the target geometry is polygonal, evaluation can be short-circuited when the test geometry has no line segment intersection. In this case, containment can be tested by an appropriate set of point-in-polygon tests. It is also possible to short-circuit in some specific cases where a proper intersection is detected between two line segments. Both of these situations can be tested efficiently.
The performance gains from using PreparedGeometry in real-world situations can be dramatic - up to 40 times improvement in speed. This functionality will appear in the next release of JTS, and hopefully will make its way into GEOS and PostGIS as well.

Thursday, July 12, 2007

The ultimate Point-In-Polygon implementation

The classic stabbing line/ray-crossing point-in-polygon algorithm has been around for longer than computational geometry itself. (See William Franklin and SoftSurfer for background and description. ) The algorithm as usually published is short, effective and clever. But it does have some limitations:
  1. It is implemented as an iteration over a data structure representing a sequence of line segments. This ties the algorithm to a fixed data structure, which leads to code duplication
  2. It handles simple rings, but not fully general polygons (which may contain holes and/or multiple shells)
  3. It does not detect the case where the query point lies on one of the segments in the polygon
  4. It runs in O(n) time. This is reasonable for a single query point, but is not the most efficient way to perform many point queries against a single fixed polygon
  5. It uses conventional floating-point arithmetic. This makes it non-robust (i.e. potentially incorrect for some inputs)
This predicate forms a key component of many spatial processing algorithms. In JTS it's used all over the place. The more places it's used, the more the above limitations have become apparent. Over time I've been slowly chipping away at removing them.

Most recently, I've revisited P-I-P in the course of some work on improving the performance of spatial predicates. And I think I've finally arrived at the ultimate Point-In-Polygon implementation. It removes all of the above limitations with the following design features:
  1. It uses an incremental design, which frees the algorithm from being tied to any particular segment data structure
  2. It handles all polygonal geometry types defined in JTS. (Implementing this was made much cleaner by the pleasant realization that the PIP algorithm really doesn't care where in a polygonal geometry the segments come from. The only thing that matters is the parity of the segment crossings. This holds true for any number of holes and shells.)
  3. It detcts the point-on-segment case, and reports it via a trivalent return value (in the set {INTERIOR, BOUNDARY, EXTERIOR}).
  4. It is straightforward to use with a spatial index on the segments of the polygon (such as STRtree or BinTree), thus providing O(log n) performance
  5. Thanks to the RobustDeterminant class in JTS, the implementation is much more robust. (Due to some remaining floating-point arithmetic it may still not be 100% robust - but I hope to address that sometime as well.)
Even with all these changes, it's not much more complex than the original algorithm (at least, not given various supporting components which are already available in JTS.)

Coming soon to a JTS version near you...

Thursday, June 21, 2007

Packed 1-dimensional R-Tree for Geometric Algorithms





Some geometry algorithms depend on searching a fixed set of 1-dimensional intervals for the ones which contain a given value. An example is Point-In-Polygon testing using the stabbing line method. In cases where the algorithm needs to be run multiple times for different input values, the performance can be improved if a suitable data structure is used to increase the efficiency of search.

Here's an idea for an index which allows fast lookup of 1-dimensional intervals. It is basically a 1-dimensional R-tree, with a packing scheme which allows easy creation of the index for a fixed set of intervals. The index has the following charateristics:
  • It indexes 1-dimensional intervals over some ordered set of values (for geometry, usually floating-point numbers)
  • It indexes a static, pre-known set of items. Once built the index cannot be modified.
  • Queries can be by either range or single value (stabbing queries), and return the set of all items which intersect the query value
  • One optional parameter is the bucket size
The tree contains two types of nodes: leaf nodes, which contains the stored intervals (and any additional user-specific information), and interior nodes, which contain an interval and references to two child nodes which lie in that interval.

To build the tree:
  • sort all the intervals to be indexed by their midpoint. These form the leaf nodes of the tree.
  • Create new interior nodes for every adjacent pair of nodes, assigning the new node to have an interval which spans the intervals of its children. If there is only one child available, do not create a new node for it.
  • Recursively repeat the previous step, until a single node is created. This is the root node of the tree.
Searching the tree is simple:
  • traverse the tree in depth-first order, pruning branches which have intervals which do not intersect the query value.
A nice aspect of this data structure is that it is simple to build. It should be possible to implement the structure using two arrays, one for leaf nodes and one for interior nodes. This should provide a pretty efficient and straightforward implementation in C.

The interior node definition can easily be generalized to allow n intervals per node (the bucket size). Of course, there's a trade-off between increasing fan-out and decreasing selectivity. It's not obvious to me where the sweet spot is.

Really this is a 1-dimensional version of Leutenegger & Edgington's STR tree. JTS even contains an implementation of this, using the generalized STRtree classes already in JTS. The novelty here is the exploration of just how simple the implementation of this structure can be. A simple implementation should be faster to build and query, as well - this is a subject for some performance testing.

There are several (lots?) of other data structures for efficiently querying sets of intervals. The segment tree and interval tree are probably the most well-known. Both of these structures are more complex to understand and implement, I believe. In addition, I don't think that they are amenable to an implementation based solely on two simple arrays.

Saturday, June 16, 2007

Presentations at FOSS4G 2007

I'm giving two talks at FOSS4G 2007, which is taking place here in Victoria in September. The abstracts are:

The JTS Topology Suite: Tools, Tips & Techniques

The JTS Topology Suite is one of the most widely used geometry libraries for Java. This talk will review the standard geometry methods in JTS, with an emphasis on their finer details. Some of the additional algorithms and components which the library provides will be discussed. Tips for improving performance and techniques for accomplishing various kinds of geometric processing tasks will be presented. The talk will conclude with an preview of some potential future developments for JTS.

Automatic watershed delineation using open source Java

The B.C. Corporate Watershed Base is a large-scale database of hydrographic information built using open source products. This talk discusses the development of an open source system for automated delineation of watershed boundaries based on CWB hydrography and a terrain model.


If previous years are anything to go by, this conference should be very informative and energizing. See you there!

Sunday, June 10, 2007

Monotone Chains: the unknown technique

When I was originally developing JTS, I spent quite of bit of time thinking about and researching techniques for efficiently computing arrangements (intersections) of sets of linestrings. There's a relatively large body of literature on this subject, as you might expect since it is such a key aspect of many geometric algorithms. As usual, however, much of this literature is fairly academic in nature, and usually doesn't address some of the more mundane issues which an implementor has to deal with.

One issue that crops up with intersecting sets of linestrings is that there is a tension between increasing performance and decreasing memory use. The fastest way should be to indexing individual line segments. However, this implies creating a new memory structure for each line segment, which can be both time and memory intensive. The most memory-effective structure is of course to leave the original linestrings undivided, but this does not provide good performance.

It seemed to me that there must be an intermediate between these two extremes. After some thought I came up with the idea that I called "monotone chains". A monotone chain is a string of line segments which are monotonic in both X and Y directions. This structure has some nice properties:
  • Monotone chains cannot self-intersect
  • Two monotone chains intersect in at most one connected set of line segments.
  • The envelopes of any two subsequences of a monotone chain are interior-disjoint. This means that the envelopes of successive bisections of the chain form a "pyramid" of envelopes. In turn, this allows comparing two chains for intersection can be carried out using a binary search technique using the two envelope pyramids.
In addition, splitting linear geometry into monotone chains is a good heuristic for partitioning the geometry. Often digitized linework representing natural features contains quite long sequences of segments which are monotone (since the linework follows natural curved features.

I was pretty pleased with my "discovery", and used it to good effect in JTS. For quite a while I was unaware of any prior use of this technique, but recently I came across a reference to a paper by Warren Burton in 1977 which apparently discusses this technique. I was led to this by the paper on Whirlpool by Nick Chrisman et al., which used this technique under the name "monotone sections".

It doesn't come as too much of a surprise that such a simple and useful technique has been used before, but what I do find a bit surprising is how little this technique is discussed in computational geometry literature. I presume the reason for this is that the technique is too simple to be of continuing interest to academics, while it is too special-purpose to be worth presenting to students. There's a dearth of computational geometry literature directed at the true practitioner - which probably indicates how few practitioners there are!

References

Burton, Warren 1977: Representation of many-sided polygons and polygonal lines for rapid processing, Communications ACM, vol. 20, no.3.

Chrisman, N.R., Dougenik, J.A. and White, D. 1992: Lessons for the design of polygon overlay processing from the Odyssey Whirlpool algorithm, Proc. 5th Int. Symp. on Spatial Data Handling, 2, 401-410.

Saturday, June 9, 2007

Quirks of the "Contains" Spatial Predicate

One of the most useful OGC standards is their specification of the Dimensionally-Extended 9 Intersection Model for spatial relationships. (To give credit where is it due, this was originally developed by Egenhofer, Clementini and others). In its most general form this model is quite complex, so to make it usable for mortal programmers, a set of commonly-useful "named predicates" have been specified. These include obvious relationships such as intersects, disjoint, touches, equals, and contains.

But are they really so obvious? In fact not - several of them have subtle aspects to their definition which are contrary to intuition.

In particular, "contains" (and its converse "within") has an aspect of its definition which may produce unexpected behaviour. This quirk can be expressed as "Polygons do not contain their boundary". More precisely, the definition of contains is:

Geometry A contains Geometry B iff no points of B lie in the exterior of A, and at least one point of the interior of B lies in the interior of A

That last clause causes the trap - because of it, a LineString which is completely contained in the boundary of a Polygon is not considered to be contained in that Polygon!

This behaviour could easily trip up someone who is simply trying to find all LineStrings which have no points outside a given Polygon. In fact, this is probably the most common usage of contains. For this reason it's useful to define another predicate called covers, which has the intuitively expected semantics:

Geometry A covers Geometry B iff no points of B lie in the exterior of A

Its converse coveredBy is also useful as well. It's a bit of a mystery why OGC did not define this predicate. In any case, Oracle Spatial does provide this predicate. I have added it to JTS as well.

There's a further bonus to defining the covers predicate. It is much easier to optimize the common use case:

covers(Rectangle, Geometry)

All that is necessary to determine this condition is to perform a simple bounding box comparison. This is not possible with contains, because even if the bounding box of Geometry is covered by the Rectangle, a further expensive operation is required to test if the Geometry lies wholly in the boundary of the Rectangle (in which case the predicate fails).

Covers "simplifies" the defintion of contains by making it more general (inclusive). There's another possibility as well, which is to simplify in the direction of being less inclusive. This would produce a predicate which might be called containsProperly, with the definition:

Geometry A containsProperly Geometry B iff all points of B lie in the interior of A

Interestingly, some recent work I'm doing indicates that this predicate might play a very useful role - but that's a story for another post.


References

DE-9IM: Clementini, Eliseo, Di Felice, and van Osstrom; "A Small Set of Formal Topological Relationships Suitable for End-User Interaction," D. Abel and B.C. Ooi (Ed.), Advances in Spatial Database—Third International Symposium. SSD '93. LNCS 692. Pp. 277-295.

9 Intersection model: M. J. Egenhofer and J. Herring; "Categorizing binary topological relationships between regions, lines, and points in geographic databases," Tech. Report, Department of Surveying Engineering, University of Maine, Orono, ME 1991.

Monday, May 14, 2007

GeoTec 2007 presentation on Automated Watershed Boundary Generation

I'm giving a talk at GeoTec 2007 in Calgary entitled "WaterBuG: An Automated Generator for Watershed Boundaries". This covers a project I've been working on for the last year-and-a-half. It's a very cool application of complex large-scale geo-processing, using a whole assortment of nifty geometric structures and algorithms such as:
  • quad-edge subdivisions for modelling
  • Triangular Irregular Networks (Delaunay triangulations)Delaunay triangulation refinement
  • Medial axis refinement
  • Surface modelling with a TIN
  • Surface hydrological flow modelling over a TIN
  • Flood filling
  • Depth- and breadth-first Graph traversal
  • Topology-preserving Linestring smoothing
  • Discrepancy detection

Here's a couple of screenshots from the talk - I'll post the full presentation sometime soon, somewhere...