Tuesday 25 December 2012

JTS 1.13 Released

I'm to announce that JTS 1.13 has been released.  It's available for download from SourceForge.

There is a long list of new features, enhancements and bug fixes in this release:

Functionality Improvements

  • Changed GeometryFactory.createGeometry() to make a deep copy of the argument Geometry, using the CoordinateSequenceFactory of the factory
  • Added ability to specify a dimension in CoordinateArraySequence
  • Changed Geometry.getEnvelopeInternal() to return a copy of the cached envelope, to prevent modification
  • Added GeometryEditor.CoordinateSequenceOperation to allow easy editing of constituent CoordinateSequences
  • Added GeometryFactory.createPolygon convenience methods which do not require holes to be specified
  • Geometry overlay methods now return empty results as atomic types of appropriate dimension
  • Added RectangleLineIntersector to provide efficient rectangle-line intersection testing
  • Added getOrdinate and setOrdinate to Coordinate
  • Quadtree is Serializable
  • STRtree is Serializable
  • Added max, average and wrap functions to MathUtil
  • Improved WKTReader parse error reporting to report input line of error
  • Improved WKBReader to repair structurally-invalid input
  • Made TopologyPreservingSimplifier thread-safe
  • Added AbstractSTRtree.isEmpty() method
  • Added QuadTree.isEmpty() method
  • Added KdTree.isEmpty() method
  • Added decimation and duplicate point removal to ShapeWriter.
  • ScaledNoder now preserves Z values of input
  • Added instance methods for all Triangle static methods
  • Added CGAlgorithmsDD containing high-precision versions of some basic CG algorithms
  • Added IntersectionMatrix.isTrue() method for testing IM pattern matches
  • Added getRawCoordinates methods to PackedCoordinateSequence concrete classes
  • Modified Geometry.isSimple() to explicity check for simplicity for all types, and support GeometryCollections
  • Improved MCIndexSnapRounder to add nodes only where they are necessary
  • Added CoordinateArrays.removeNull() method
  • Enhanced GeometryEditor to handle null geometries returned from operation
  • Added WKBHExFileReader
  • Added Distance3D operation

Performance Improvements

  • Simplified & improved performance of RectangleIntersects by using new RectangleLineIntersector
  • In RandomPointsInGridBuilder eliminated redundant ArrayList usage
  • In PreparedPolygonIntersects and PreparedLineStringIntersects added check to avoid creating segment index if all test inputs are points
  • In AbstractSTRtree switched to using indexed list access for better performance than using iterators
  • In AbstractSTRtree freed inserted item array after index is built
  • Improved performance of Polygonizer for cases with many potential holes
  • Improved performance for some DD methods by making them final
  • Added fast filter for CGAlgorithmsDD.orientationIndex, and switched to self-operations for DD determinant
  • Changed STRtree.createNode() to use a static class for nodes
  • Changed QuadTree Node to use scalar x and y variables rather than a Coordinate to reduce memory allocation
  • Fixed PreparedGeometry concrete classes to be thread-safe.
  • Fixed SortedPackedIntervalRTree so that it is thread-safe.

Robustness Improvements

  • Switched to using DD extended-precision arithmetic to compute orientation predicate
  • CGAlgorithms.distanceLineLine() improved to be more robust and performant
  • Fixed robustness issue causing Empty Stack failure in ConvexHull for some nearly collinear inputs
  • CGAlgorithms.signedArea() uses a more accurate algorithm

Bug Fixes

  • Fixed Geometry.equalsExact() to avoid NPE when comparing empty and non-empty Points
  • Fixed CascadedPolygonUnion to discard non-polygonal components created during unioning, to avoid failures and provide more desirable behaviour
  • Fixed CentralEndpointIntersector to initialize result correctly
  • Fixed DelaunayTriangulationBuilder.extractUniqueCoordinates(Geometry) to avoid mutating the vertex order of the input Geometry
  • Fixed ConformingDelaunayTriangulationBuilder to allow non-disjoint site and constraint vertex sets
  • Fixed RandomPointsInGridBuilder point generation to use circle constraint correctly
  • Fixed Linear Referencing API to handle MultiLineStrings consistently, by always using the lowest possible index value, and by trimming zero-length components from results
  • Fixed bug in LocationIndexedLine and LengthIndexLine which was causing an assertion failure when the indexOfAfter() method was called with a constraint location which is at the end of the line
  • Fixed bug in STRtree.query(Envelope, ItemVisitor) causing an NPE when tree is empty
  • Fixed issue with creating zero-length edges during buffer topology building under fixed precision, by: adding filter to remove zero-length edges; using a better estimate of scale factor for reducing to fixed precision after initial failure.
  • Fixed TopologyPreservingSimplifier to return a valid result for closed LineStrings with large distance tolerances
  • Fixed TopologyPreservingSimplifier to return an empty result for an empty input
  • Fixed DouglasPeuckerSimplifier to return an empty result for an empty input
  • Fixed MinimumBoundingCircle to correctly compute circle for obtuse triangles.
  • Fixd GeometryPrecisionReducer to use input GeometryFactory when polygon topology is fixed
  • Fixed GeometryNoder bug that was failing to snap to end vertices of lines
  • Fixed Geometry.getCentroid() and Geometry.getInteriorPoint() to return POINT EMPTY for empty inputs
  • Fixed DelaunayTriangulationBuilder to correctly extract unique points
  • Fixed KdTree to correctly handle inserting duplicate points into an empty tree
  • Fixed LineSegment.projectionFactor() to handle zero-length lines (by returning Double.POSITIVE_INFINITY)
  • Fixed LocationIndexedLine to handle locations on zero-length lines
  • Fixed LengthIndexedLine and LocationIndexedLine to handle indexOfAfter() correctly
  • Fixed WKBReader to handle successive geometrys with different endianness
  • Fixed GeometricShapeFactory to correctly handle setting the centre point
  • Fixed GeometryFactory.createMultiPoint(CoordinateSequence) to handle sequences of dimension > 3

API Changes

  • Changed visibility of TaggedLineStringSimplifier back to public due to user demand


  • Added Performance Testing framework (PerformanceTestRunner and PerformanceTestCase)
  • Added named predicate tests to all Relate test cases

JTS TestBuilder

Functionality Improvements

  • Added segment index visualization styling
  • Improved Geometry Inspector
  • Added stream digitizing for Polygon and LineString tools
  • Added output of Test Case XML with WKB
  • Added Extract Component tool
  • Added Delete Vertices Or Components tool
  • Added Geometry Edit Panel pop-up menu, with operations
  • Added Halton sequence functions
  • Added sorting functions
  • Added function for selection of first N components
  • Added CGAlgorithms functions
  • Added ability to paste and load multiple WKBHex geometries

Performance Improvements

  • Using decimation substantially improves rendering time for large geometries.

Bug Fixes

  • Fixed bug in saving XML test files

Thursday 20 December 2012

Convenience trumps all

The always-readable Stephen O'Grady has an insightful post titled Do Not Underestimate the Power of Convenience.  He proposes that an increasingly important factor driving the uptake of software is the developer's drive for convenience.  "Convenience" is an, er, convenient term for things like ease-of-use, power, portability and low barrier to entry.  In the case of software the lowest possible barrier is provided by free open source software, but the principle applies to infrastructure as well (reflected by the rapid uptake of things like cloud computing and BYOD).  As evidence he lists a set of technologies whose prevalence has been driven bottom-up by developers, rather than top-down by corporate fiat.  It reads like the bill-of-materials for IT infrastructure in a start-up: AWS, Linux, dynamic languages, Git, Eclipse, etc.

One reason he gives for this situation is the pleasing (to my ears) observation that "Developers are the new Kingmakers" (which he discusses in detail in another post).  I suspect that these are mutually-reinforcing phenomena.  Developer's drive for convenience has led to the rapid evolution of open source software (and not coincidentally the improvement of the tools which enable its development).  This has led to the current situation where in many cases OSS surpasses commercial offerings, or at least is "good enough" to be used in demanding production environments.  Since the developers are the ones making the software, or at least are the ones who understand it, they need to be involved in the decisions to adopt it (if they aren't spearheading the adoption themselves). (And especially since there are usually no marketing people in sight...)

In fact, I contend that the drive for convenience is the motivation for many of the advances in software and computation.  A classic example is the ongoing quest to increase the concision of computer languages, by making them more expressive and provide a richer computation model.   Another is the evolution of systems which can easily automate mundane tasks (with examples including the rich shell and command-line capabilities of *nix and the increasingly powerful array of build tools - the best of which of course are free and open source.)  Developer's preference for software which is portable and standard is also driven by convenience - everyone wants their current favourite set of software available wherever they happen to be are working (and as expected, open source software typically exhibits the highest degree of portability).

As Larry Wall said, laziness and impatience are virtues for a programmer.

Lazy programmers avoid yak shaving

Taken to the extreme, the drive for convenience is simply another way of stating the ultimate goal of all compute science - to free computation from any limitations of space, time, money and power.  Put this way, O'Grady's thesis is almost a tautology.  But it's a valuable reminder of the constant need to push against the strong opposing forces of commercial interest and bureaucratic inertia.

Wednesday 14 November 2012

YAUSEM (Yet Another US Election Map)

The usual US election map is a starkly simplistic collection of red and blue blobs.  This does reflect the esoteric (to a Canadian) Electoral College first-past-the-post arrangement.  But after seeing how close the actual vote counts were in most states, it seemed to me like this doesn't really reflect the actual political reality of the US.  Really it's pretty much a purely purple country.  John Nelson has a nice map that elegantly visualizes this reality at a county level (and Brian Timoney explains why it's the only map that isn't a lie).

As another attempt at map truthiness, I used JEQL to produce the following map.  It shows actual vote numbers at a state level, color themed along two dimensions:
  • The hue shows the relative proportion of Democrat VS Republican votes (using the now-canonical blue and red).  For reference, Florida is almost exactly 50-50.  This nicely shows that really the US is just varying shades of purple.
  • The saturation is proportional to the relative population of the state.  California is fully saturated, since it's the most populous state.  The the inland Western and the far Northeastern states are pretty pale, since they have fairly low populations.  This is roughly proportional to the weight of the state's Electoral College votes, although there are amusing anomalies.

I make no claim that this map represents any valid statistics - it's just a fun exercise in using JEQL to do spatial visualization.  For reference, the script is:

CSVReader t colSep: "\t" file: "us_vote_raw.txt";

t = select String.trim(col1) name,
    Val.toInt(String.keepChars(col2, "0123456789")) ecVote,
    Val.toDouble(String.keepChars(col4, "0123456789")) demvote,
    Val.toDouble(String.keepChars(col5, "0123456789")) repvote from t;

Print t;

maxVote = val(select max(demvote+repvote) from t);

ShapefileReader tus file: "us_state.shp";

tvote = select name, demvote, repvote, GEOMETRY,
        demvote / (demvote+repvote) demfrac,
        demvote+repvote totvote,        

       (demvote+repvote)/maxVote density
    from t
    join tus on t.name == String.trim(tus.STATE_NAME);

Mem tvote;

tplot = select GEOMETRY,
    #ffffff styleStroke,
    1 styleStrokeWidth
    with {
        clr = Color.interpolate("ff0000", "0000ff", demfrac);
        h = Color.getH(clr);
        s = density;
        v = 1;
        styleFill = Color.toRGBfromHSV(h,s,v);    }   
    from tvote;

extent = BOX(-128 20, -65 50);

Plot    data: tplot
    extent: val(extent)
    width: 800
    height: 400
    background: "0000aa"
    file: "us_vote.png";

The raw data came from Wikipedia via simple cut-and-paste to a text file.

Tuesday 13 November 2012

The Great Geometry Clipping Contest

Don Meltz initiated a fascinating flurry of performance evaluation with his post on Is QGIS a Viable Alternative to ArcGIS? and its followup ArcGIS vs QGIS Clipping Contest Rematch.  He looked at a spatial processing task involving clipping a large dataset of contour lines against a fairly simple polygon.  His conclusion was that QGIS was a lot faster than ArcGIS at performing this task.  His final testing produced a time of 6 min 27 sec for QGIS, versus a time of 1 h 35 min for ArcGIS - and which failed with a topology error!  (Note:  subsequently ESRI reported that they have improved their algorithm to provide much better results for this case - and presumably to enable it to actually complete!).

This inspired a lot of other people to dive in and run the same test (since Don helpfully provided the test data here).  Systems tested include many commercial (ArcGIS, GlobalMapper, Manifold) and FOSS systems (QGIS, PostGIS, SpatialLite, GRASS, OGR, uDig, OpenJUMP, etc).  There's a summary of some of the timing results here (and the many comments to these posts provide lots of different timings on various software and hardware configurations).

On the Java side, Andrea Antonello of jGrass provides an in-depth description of his optimized implementation using uDig, GeoTools and JTS here.  His best result was about 80 sec, using 4 cores and including data I/O. (He later tested on Amazon AWS using 32 cores, giving a 25 sec time).

The SpatialLite wiki has a nice page showing how this problem is tackled in SQL here.  It also has an excellent analysis of what this contest actually demonstrates. The key conclusion is that almost all the open source systems are using JTS or GEOS, so what is really being measured is the effectiveness of the JTS overlay algorithm.

The one exception is GRASS, which uses a completely different topological algorithm.  From the results the performance of this seems similar to or only a bit slower than JTS/GEOS.  This is actually quite impressive since it sounds like the algorithm is computing full topology of the data.  But it's hard to know without a more detailed understanding of how it works.  And in any case, comparing the contest timings is difficult since many different hardware configurations were used.

One aspect of this task is that it is "pleasantly parallel", so implementations which can multi-thread over many cores should see linear performance improvement.  Only some of the systems tested were able to take advantage of this.  In particular, PostGIS and SpatialLite execute single queries in a sequential fashion, which is a shame. Perhaps this will be rectified in future releases?

It's great to see JTS and GEOS used effectively in so many different applications, and that they hold their own against the well-funded competition (and in fact often doing much better).

And here's the kicker - it could run even faster! The JTS overlay algorithm was designed for the single geometry/geometry case.  It has not been optimized for iterated operations against a fixed query geometry.  By using a caching approach similar to the existing JTS PreparedGeometry API, it will be possible to avoid  recomputing polygon topology and thus provide a significant speedup.  Also, the overlay algorithm was designed to handle all geometry types, and in particular the polygon/polygon case, which is the most demanding situation.  It could be optimized to handle simpler cases (such as the polygon/line overlay of this task) in a faster way.  All it will take is some time, money, and some hard thinking...

Thursday 18 October 2012

State Of The Map US - 2012

Thanks to OpenGeo, last weekend I attended the OpenStreetMap State of the Map (US) conference in Portland, Oregon. It was a great conference to attend.  Below are some of my takeaways.

[Obligatory disclaimer: these are my opinions, not necessarily those of OpenGeo]

It was an interesting change from being at software conferences like FOSS4G.  The conference was mostly about software, but from a reverse perspective. I
nstead of new software ideas in search of applications and data, the perspective was of a huge, freely-available, rapidly growing dataset and how this encourages the development of innovative software to manage, display and analyze it.  One result of this was less of a "tribal" split among the attendees. (A tribal rift that might have appeared is choice of mapping engine, but just about everyone there was in the Mapnik tribe. I had to dig to find GeoServer or MapServer users.)

What was similar to FOSS4G was the notable excitement about transforming established business models - but with different players, organizations which use (and usually pay for) road and transit data, rather than established software vendors. Also similar was the feeling of community, the energy and passion of the people involved, and the sense of being involved in something that is a radically new and more empowering way of building something that is essential in people’s lives.
(See the presentation When Google Maps Gives you Lemons, make Lemonade for a great example of why and how this is happening).

Incidentally, the pre-conference party was at GeoLoqui, who later that weekend announced their acquisition by ESRI. Apparently the big money hadn't started flowing their way yet, since they ran out of beer just after we got there.

  • about 225 attendees
  • 3 tracks of presentations over 2 days
  • 3rd SOTM-US.  First one was 2 years ago, with only 40 attendees

The list of talks with links to some presentations.

Comments on Selected Talks

The other thing similar to FOSS4G is that there are many more good talks than one person can possibly take in. Here's notes from ones I did attend:

  • Steve Coast (formerly of Cloudmade, now at Microsoft) gave his 1000th "Founder of OSM" keynote.  He said that the top thing OSM needs to continue to grow is addressing.  He singled out Google Mapmaker as being a clear and present danger to the growth of OSM.
  • Check the great video Address is Approximate, mentioned by Henk Hoff in his keynote
  • Dane Springmeyer and Artem Pavlenko from MapBox presented on new features of Mapnik.  
    • Ability to define raster compositing operations between map layers.  Showed using it to obtain tint bands, as well as some other examples which are maybe more interesting than essential.  
    • Vertex converters, similar to GeometryTransformations in GeoServer.  Uses include simplification (various simplification methods were tested - not clear which are available) and smoothing
    • Format support: CSV, GeoJSON, Python
  • Nathan Kelso and Michal Migurski from MapBox presented on how to prepare OSM and terrain data to make a purty base map.  A lot of work is required, and most of it is slow/complex and thus has to be done offline (the beauty of tiles...).  They have a clever simulated-annealing based labeller with a cool video.
  • AJ Ashton from MapBox talked about OSM data preparation for effective cartography.  It makes the good point that good multi-scale cartography takes a lot of pre-rendering data prep, as well as additional data sources to OSM (such as NaturalEarth for placename priority).  
  • MapBox and the Knight Foundation announced their big grant ($575K) to “improve OSM tools”.  There was a palpable sense of concern in the community attending about what this would mean for the current developer community.  Apparently there are still hard feelings about past experiences with CloudMade and MapQuest.  
  • Tom Macwright from MapBox talked about OSM infrastructure.  I had no idea it was run on such a minimal infrastucture (3 servers in London).  The codebase (known as “RailsPort”) is pretty Rube(y) Goldbergian.  The bus factor is a bit too low for comfort.  The good news is that the data fits in a terabyte, so there are mirrors all over the place.
  • Steve Coast raised an interesting question wondering if it would improve the OSM codebase if it had reusability/portability as an explicit goal.  You might think it would be rare to find other people who need to run a massive crowd-sourced map of the world, but Eric Wolf of USGS says that they run a fork of OSM internally to support their mapping.  He did say that it was difficult to contribute back to the trunk.
  • Nathan Van Der Wilt presented his ArgyleTiles project to create a tiled base map of earth imagery.  A great idea, but one that’s been tried before (where have you gone, OpenAerialMap?).
  • David Turner from OpenPlans discussed using OpenTripPlanner with OSM.  They have highly-tuned route planning rules which work with OSM tags.  OpenTripPlanner is a very interesting and useful project.  They make a fair bit of use of JTS and spatial algorithms (such as linear referencing and concave hulls)
  • Portland TriMet presented on their use of OSM and OpenTripPlanner to provide extremely high-quality routing (much better than Google in many cases, particularly for multi-modal involving human-powered transport).  They did a lot of work to improve OSM data in Portland to make routing work better. They emphasized that this was not that expensive - 4 interns over a few months to improve and QA the entire area.  This is an ongoing project, and is done interatively in conjunction with the OpenTrip Planner, to identify routes which show up as taking longer than expected. TriMet also presented on their extensive use of open data for mapping.
  • There were a couple of talks from TeleNav, a commercial traffic reporting and routing company. They have 30M cllients globally reporting and using traffic data. Sounds like they are switching to use OSM data.  They are putting a lot of work into error detection and data cleaning.  They have created an open alternative to the TeleAtlas/NavTeq TMC spec (which is licensed) call TTL.  This creates a standard set of road segments which is the basis for traffic data collection and reporting.  
  • Martijn Van Exel is a longtime OSM user and advocate, now based in Salt Lake City.  He is doing a lot of work on error detection and reporting in OSM, notable creating the Remap-A-Tron.  He’s using GeoServer (yay!).  I told him about the new heatmap rendering, which might be of interest for visualization
  • Jeff Meyer presented on spatio-temporal applications in education and the humanities.  He made a plea to model and capture temporal attributes in OSM, to preserve its value into the future.
  • Abe Usher had a hilarious presentation on Heatmaps for data visualization.  
  • Ben Standefer from Urban Airship talked about how location-based push to mobile devices is becoming big business.  They are using OSM for things like identifying neighbourhood polygons, and POI (points of interest) polygons (eg arenas).  They’re using the JTS STRtree spatial indexing (yay! Blows me away that it is suitable for hard production use)
  • Alex Barth from MapBox talked about Carmen, an open-source geocoder they are developing.  It seems more like a reverse geocoder - most of what he talked about was how to identify named locations from points.  Nothing was said about the hard parts of address geocoding (such as parsing, error handling, address models, and fuzzy matching)

Monday 10 September 2012

Halton sequences, at last

A while ago I posted on generating random point sets, and Sean Gillies suggested Halton sequences as a way of generating nice-looking point distributions.  I finally got around to looking into this idea. Indeed he was correct - Halton sequences provide a very pleasing appearance for point distributions (much better than purely random distributions).
The above shows the first 1000 elements of the Halton sequence with bases (2,3).  Larger bases exhibit progressively more coherence, which may or may not be desirable depending on application. Here's H(7,9):
 The code to generate Halton sequences is pretty trivial - see the Wikipedia article for some pseudo-code.  (Although I have to say that the explanation of their derivation is pretty poor.  I've seen this sad trend on some other Wikipedia mathematical articles as well - in particular the one for Barnes surface interpolation.  I suppose the rejoinder would be to get in there and improve it!)

Keying off something else that Sean showed on his blog, I also played around with using Halton sequences to generate quasi-random sets of polygons.  Here's one example, using a slightly different approach to Sean's:

And the same for H(5,7), which looks slightly nicer I think:

Might come in handy someday for generating test data, or possibly as a visual texture.

The complexity of simple

Recently Sean Gillies and others have been looking at the concept of geometric simplicity as implemented in JTS/GEOS, specifically in the case of polygons.  The TL;DR is that JTS has a bug because it reports the following polygon as simple:

POLYGON ((39.14513303 23.75100816, 97.05372209 40.54550088, 105.26527014 48.13302213, 100.91752685 58.43336815, 71.56081448 83.55114801, 60.71189168 86.25316099, 62.00469808 75.1478176, 83.16310007 42.82071673, 92.82305862 37.19175582, 95.99401129 26.47051246, 106.22054482 15.51975192, 39.14513303 23.75100816))

This polygon is invalid, because the shell ring self-intersects.  This means the ring is non-simple - but it's not so clear-cut for the polygon itself.

In fact, reporting the polygon as simple isn't a bug - it was a deliberate design decision.  The original rationale was:

  • the OGC Simple Features Specification in section 2.1.10 defining polygon geometries states that "Polygons are simple geometries".  (This is consistent with the requirement that valid polygons contain no self-intersections).  
  • JTS usually follows the principle that computed results are only well-defined when the input geometries are valid.  This is because it can be expensive to check for validity in order to handle invalid inputs gracefully.  Validity testing is factored out into a separate method (isValid) in order to allow the client to decide when to incur the code of executing it.
For these reasons the initial JTS implementation of isSimple returned true for all polygonal inputs, thus avoiding the cost of checking for self-intersections.

However, perhaps it would be more useful to actually carry out some testing of simplicity.  This raises the question of what exactly are the semantics of isSimple for polygons?  Looking to the SFS for guidance, it has the following somewhat imprecise general definition of simplicity:
the Geometry has no anomalous geometric points, such as self intersection or self tangency. The description of each instantiable geometric class will include the specific conditions that cause an instance of that class to be classified as not simple.
For polygons the spec simply states:
Polygons are simple geometries
(In contrast, for linear geometry the specification has a rigorous mathematical definition of simplicity.)

Apart from the "always simple" semantics, there are at least two other options:
  1. Polygons are simple if their component rings are simple (i.e. have no self-intersections)
  2. Polygons are simple if they have no topological anomalies.  This is equivalent to checking whether the polygon is valid.
Because option #2 simply replicates the semantics of isValid, it seems more useful to adopt option #1 to provide more possibilities for analyzing polygon topology (but I'm definitely open to suggestions about why #2 would be better!)

This change to the semantics of isSimple has been implemented in JTS and will appear in the 1.13 release.  The code will also be extended to handle GeometryCollections, with the semantics that they are simple if all their components are simple.

We're all in deep map now

A few thoughts on the Atlantic article on GooMaps...

  • I love the term "the deep map".  I'm not doing GIS any more - I'm doing deep mapping!
  • Yay!  Geospatial conflation gets a mention in the MSM!
  • The article name-checks Borge's map.  The original story pointed out the paradox of a 1:1 scale map - but this is only paradoxical in 2-space.  In the infinite-dimension virtual world it's possible to have maps that are more detailed than the physical world (in that they can represent relationships across time and other more abstract dimensions)
  • That is some great synergy between StreetView data acquisition and Google's work on self-driving cars
  • The author is convinced that Google has an unassailable lead on building the virtual map of the world.  But without knowing what's going on at Apple (and perhaps Microsoft) I wouldn't be so sure about that.  (And even Google isn't as transparent as this realtime view of OSM edits.)
As always, James Fee has a pithy comment and a great picture. The good news is you get a ping pong table in the office.  The bad news is that 2000 more map edits just came in, so get back to work!

Saturday 18 August 2012

Battle of the SSLs - Round 3

After the warmup of rounds 1 & 2 of the Battle of the Spatial Scripting Languages, it's time to pick up the pace!  The last two rounds were relatively simple tasks which any ETL system worth its keep should be able to tackle.  So here's a more interesting problem:
Given a KML file of volcano locations, and a Shapefile of country boundaries, determine the countries with the greatest density of volcanoes.

It's slightly contrived, but contains a nice variety of spatial and data manipulation tasks, including reading spatial data formats, spatial joins, and grouping and sorting.

Here's the JEQL script:

KMLReader tvol
  file: "http://www.volcano.si.edu/ge/GVPWorldVolcanoes-List.kmz";
ShapefileReader tworld file: "geocommons-world.shp";
Mem tworld;

tc = select NAME, first(AREA) area, count(*) num 
    from tvol join tworld 
      on GeomPrep.contains(tworld.GEOMETRY, tvol.geometry) 
    group by NAME;

td = select NAME, num, area, Val.toDouble(num)/area density 
    from tc where area > 0 
    order by density desc 
    limit 20;

HtmlWriter td file: "vol_density.html";

A few points to notice:
  • JEQL can read KML/KMZ files from the Web
  • JEQL's way of  chaining select statements together is more readable and maintainable than the regular SQL nested syntax
  • The first() aggregate function in JEQL allows selecting a value from a non-aggregated column.  This is often very awkward and inefficient to do in regular SQL.
  • The GeomPrep function set uses the JTS PreparedGeometry API to optimize repeated geometry predicate evaluation
And here's the places where you might not want to invest in hillside property:

Dominica              4    75 0.0533333333333333
Grenada               1    34 0.0294117647058824
Tonga                 2    72 0.0277777777777778
St. Vincent and the   1    39 0.0256410256410256
St. Lucia             1    61 0.0163934426229508
Martinique            1   106 0.0094339622641509
Comoros               2   223 0.0089686098654709
El Salvador          17  2072 0.0082046332046332
Western Samoa         2   283 0.0070671378091873
Guadeloupe            1   169 0.0059171597633136
Reunion               1   250 0.004
Vanuatu               4  1219 0.0032813781788351
Cape Verde            1   403 0.0024813895781638
Iceland              24 10025 0.0023940149625935
Solomon Islands       6  2799 0.0021436227224009
Guatemala            21 10843 0.0019367333763719
Costa Rica            9  5106 0.0017626321974148
Japan                64 36450 0.0017558299039781
Armenia               4  2820 0.0014184397163121
Nicaragua            16 12140 0.0013179571663921

Word frequency using JEQL

Ryan Tomayko has a post on how Ruby recapitulates AWK (or to be more biologically accurate, how it carries vestigial traits which reveal its evolutionary lineage from AWK down through Perl).

He gives an example of how curl, AWK, and sort can be chained together to compute word counts for Swift's A Modest Proposal:

curl -s http://www.gutenberg.org/files/1080/1080.txt |
ruby -ne '
  BEGIN { $words = Hash.new(0) }

  $_.split(/[^a-zA-Z]+/).each { |word| $words[word.downcase] += 1 }

  END {
    $words.each { |word, i| printf "%3d %s\n", i, word }
' |
sort -rn

Back in the day I was an enthusiastic user of AWK.  I was happy to discover that  JEQL can be handily used for similar kinds of text processing, when equipped with suitable string handling and RegEx functions. Here's the word count functionality in JEQL (using a source for the text that is more bot-friendly than Project Gutenberg):

TextReader t file: 

t = select String.toLowerCase(splitValue) word from t 
      split by RegEx.splitByMatch(line, "[a-zA-Z]+" );

Print select word, count(*) cnt from t 
        group by word order by cnt desc; 

AWK had a bit of a rep for being somewhat write-only.  To my SQL-attuned eyes the JEQL version is more understandable.

Thursday 2 August 2012

JTS helps ESRI with big data problem

Can't help but feeling a little smug about this post on using JTS in a Hadoop process for generating heatmaps for demographic data.

He only seems to be using JTS for generating point buffers, which is hardly a challenging use case.  But if having a fast point buffer algorithm is a key metric I'm not going to complain.

As usual the link for JTS is pointing to the old Vivid site - the current one is this.

Tuesday 3 July 2012

Global Earthquakes since 1898

This post shows off a very cool image of global earthquakes since 1898.  

No link to the source dataset, though.  That's unfortunate, because it would be really cool to see this as an animation, and as a KML file (with timestamps, to allow temporal scrolling).  Hmmm... sounds like a job for JEQL.

Update:  the post does give a link to ANSS, which has a service making the data available in a variety of formats, including KML.  It doesn't have a timestamp for temporal scrolling, however - and perhaps a different way of styling might be interesting.  A heatmap, perhaps?

Wednesday 27 June 2012

The Roshambo-bots are coming!

The Japanese have figured out a way to let me avoid taking the garbage out on rainy winter evenings - a Rock-Paper-Scissors robot that always wins!

It actually cheats, though - it reads your mind to determine what move you're going to throw.

Now, enough playing with the easy stuff - get back to work making a Go program!  Or at least Rock-Paper-Scissors-Lizard-Spock...

Thursday 17 May 2012

A scientific basis for Open Source Software

Stefan Steineger of the OpenJUMP project pointed out this great paper in Nature on The case for open compute programs.  The paper raises the argument for open source software to a higher plane, that of being a necessary component of scientific proof.  It points out that the increasing use of computational science as a basis for scientific discovery implies that open source must become a standard requirement for documentation.  Apparently some journals such as Science already require source code to be supplied along with submissions of articles.  Amongst other advantages, access to source code is an essential element of peer review.

An interesting example they mention is the infamous HadCRUT and CRUTEM3 meteorological datasets.  One of the (few) salient criticisms levelled at this information during Climategate was the inability to  reproduce the results by re-running the software. (Mind you, the software was probably a pile of crufty old Fortran programs mashed up by Perl scripts, so maybe it's just as well).

I'm looking forward to seeing JTS get cited in academic papers (actually, it already has been).  Maybe I even have a finite Erdos number!

It's maybe too much to ask that mere scientists be coding hipsters, but I noticed that SourceForge is presented as the leading example of collaborative software development.  Someone should introduce them to GitHub - which truly walks the talk.  Researchers in bioinformatics should be especially appreciative of the sweeping effect of recombinant software development it enables.

Why we were stronger in the 80's

I love this picture. Nowadays this functionality weighs only a few ounces, and we carry it in a pocket.

Sunday 22 April 2012

Grandpa's Googler

A peerless pastiche by SWWTMTOTH...  Try it, it works!  Go ahead, make it your homepage!

Hands up all those who remember modem whistles.  Takes me back...    For maximum verisimilitude, every other time you hit this site it should give you a busy signal.

Little known fact - DSL and cable modems actually make the same sounds, they're just pitched too high to hear  8^

Wednesday 18 April 2012

FOSS4G-NA 2012 review

Last week I was at FOSS4G-NA in Washington DC.  It was my first time in DC, and my first FOSS4G as a member of the OpenGeo team. Both were very pleasant experiences!

One of the keynotes was by Josh Berkus of PostgresQL talking about "firehose databases".  FOSS4G is always a firehose conference, and this one was no exception.  I missed many more talks than I wanted to take in, for various reasons - the 3 simultaneous tracks, my own commitments to talks and OpenGeo work, and often because it's just such a good chance to talk to great people from all around the FOSS4G world.

Here are some of the my conference highlights:
  • another inspiring keynote from Paul Ramsey, on the origins and future of open source
  • the Ignite Spatial talks, which were fast, fascinating and funny
  • a panel discussion which hit on the need for a replacement for the shapefile (PLEASE!)
  • hearing Josh Berkus talk about the impressive new features coming in PostgresQL 9.2 (JSON, index-only scans, SP_GIST, to name a few)
  • Andrew Turner's fastpacedpresentationonGeoIQ
  • meeting Kevin Web and hearing more about OpenTripPlanner
  • discussing skeletonization, long-distance hiking, and sewage treatment with Stephen Mather 
  • talking to Josh Marcus about GeoTrellis, an engine for scalable geoprocessing
  • talks on using NoSQL DBs such as MongoDB and HBase for storing and querying spatial data
  • a discussion with Javier  de la Torre of CartoDB about the need for a spatially-sensitive sampling technique (a la Oracle's SAMPLE clause) in PostGIS.  We ended up discussing this with Josh Berkus (since he was sitting at the next table!) and apparently there is some work starting on this in PostgreSQL.  Hopefully this will extend to spatial datasets as well.
  • and of course, meeting lots of people using JTS!

Here's a few other reviews from around the web:

Friday 9 March 2012

Battle of the SSLs - Round 2

In a comment to Battle of the Spatial Scripting Languages, Sean suggested that a slightly more in-depth test of language expressiveness might be this task, involving a sequence of CRS transformation and geometry cleaning.  Here's the JEQL equivalent:

ShapefileReader t file: "test_uk.shp";

trans = select * except GEOMETRY, 
               CRS.transform(GEOMETRY, "epsg:4326", "epsg:27700") geom from t;

clean = select * except geom, isValid ? 1 : 0 isValid, cleanGeom
    with { 
        isValid = Geom.isValid(geom);
        cleanGeom = isValid ? geom : geom; //Geom.buffer(geom, 0);
    } from trans;

ShapefileWriter clean file: "test_uk_valid.shp";

The code is spread out a bit for clarity.  It could have been done in single SELECT statement, apart from the isValid flag added for reporting purposes.  Note the use of the JEQL extension with clause, which improves code clarity, and also allows a more imperative style of coding which can sometimes be preferable.

Fiona gets points for being able to define any function directly in the language, but there's a price to be paid in code complexity.

Here's the result (seen in the new JEQL Workbench Geometry Viewer):

The TI-58/59 programmable calculators

Mobile computing is nothing new!  My first programmable device was a TI-58 calculator:

I spend many happy hours punching in the code for Lunar Lander (and less time actually playing it).  Wish I still had my coding sheets for the various apps I dreamed up.  (For some reason, the convention then was to write the key codes vertically, rather than horizontally.)

TI had their own app store, selling Solid State SoftwareTM on plug-in ROM modules.  Of course I got the Leisure one, which had a few lame games on it.  Then as now, though, I had a lot more fun coding up new apps than playing games...

The calculator cognoscenti will notice the image is actually the TI-59, with the built-in magnetic card reader for offline permanent storage.  Had one of those too, once I got tired of eternally punching in programs. The mag card reader was pretty finicky, but it sure was cool filling up that little vinyl case with all that software that never got used...

A couple of years later HP released the iPhone of the calculator era - the HP-41C, which had an alphanumeric display!  HP was just as opinionated as Apple, in their own way - if you wanted one of their gorgeous units, you had to grok RPN notation. Kinda reminds me of my current effort of learning git.  I never did get one of those units, because by that time it was a lot more rewarding to punch code into a real microcomputer....

Reminiscing about the PC Pre-Cambrian explosion

With the mobile meteorite looming ever larger over the landscape of Planet PC, it seems like a good time to recall some memories of the last major transition in the computer world. I call this the Pre-Cambrian explosion of Personal Computers.  It took place from the mid-70's to the early 80's, when dozens of entrepreneurial companies sprang up all hoping to hit the magic combination of hardware and software and sell a zillion units.

In a way, it was exactly the inverse of what is happening now. Back then, a myriad of bizarre life forms sprang up to fill the niche exposed by the advent of the microprocessor, only to be driven into extinction by the arrival of a behemoth with superior reproductive capability.

Whereas today, the new ecological zone was initially colonized by a single fierce predator, who may yet out-compete all other arrivals trying to crawl out onto the shore of the new territory.

On to a series of occasional posts on the Pre-Cambrian PC era...

Thursday 8 March 2012

Java's vision for the future

Just noticed this slideshow from QCon 2012 on the vision for Java's evolution over the next few versions.

Some highlights of JDK 8:
  • closures via lambda expressions
  • filter-map-reduce framework
  • modules
and further out:
  • JNI goes away!
  • support for big in-memory data (arrays) and data structure optimizations
A key driver of these enhancements is the transition to multi-core and the end of the free performance lunch.  This is a great graph that puts it in perspective - those flat lines are the lunch bill arriving.

QCon always has excellent presentations, so I'm sure there's some other thought-provoking slideware on that site as well.

Tuesday 6 March 2012

Battle of the Spatial Scripting Languages!

The gloves are down in this exchange on gis.stackexchange, over LOC counts for Fiona, JEQL, and GeoScript-JS.

Fiona takes one on the chin in the first round!  But Sean recovers quickly, and makes a hit by pointing out that Fiona allows defining new functions right in the scripting language.  This would be feasible in JEQL as well, by providing hooks to into JVM-based scripting languages.  And it's also very easy to develop extensions as simple Java functions or classes.

Now, this is really just a schoolyard scuffle over a very simple use case for spatial scripting.  What the fans really want to see is a face-off over a more complex task...

Wednesday 22 February 2012

A cool application of JTS Voronoi diagrams

I just noticed this series of posts by Stephen Mather on generating medial axes of polygons by creating Voronoi diagrams on the densified linework. 

If I read correctly, he's doing this via Python GeoScript, which uses JTS for the heavy lifting.

As he says, this is a venerable technique ( hack) for creating medial axis approximations.  Regardless of its computational soundness (and hey, whatever works), it's a great example of using a series of complex geometric operations to compute a useful result.

I'll have to try this in JEQL....

(Stephen seems to have moved on to implementing the Scale Axis Transform in PGSQL - very impressive!)

Tuesday 21 February 2012

Barnes Analysis for Surface interpolation

Recently I've been working on generating surfaces from irregular sets of data observations, for thematic rendering purposes.  The data I'm working with is meteorological measurements (e..g max and min temperature, average rainfall, etc.).  Here's a sample:

Maximum Daily Temperatures for November 30, 2010

There are many, many different ways of interpolating surfaces - Wikipedia has a long list.  It turns out that a classic approach for interpolation of meteorological data is Barnes Interpolation (also called Barnes Analysis).  Barnes Interpolation is a surface estimating technique which uses an initial pass to provide a smoothed estimate of the surface, and optionally further refinement passes to improve the surface to more closely match the observation values.  The initial pass produces an averaged (smoothed) surface, using a summation of exponential (Gaussian) decay functions around each observation point.  Subsequent refinement passes compute an error surface using the delta between the previous estimated surface and the observations.  The error surface is added to the previous estimate surface to refine the estimate (by reducing the delta to the observations).

I'm more of a vector guy than a raster guy, but I figured that Google and Wikipedia would make short work of this - after all, how hard can raster be?  Unfortunately, the Wikipedia article is not very helpful - the description is unclear, and it doesn't even define all the symbols used in the presented equations.  I found some better references here and here, but they still contained some confusing discrepancies in the equations.  Barnes' original paper is online as well, but as is often the case with primary sources, it is more concerned with the physical properties of the analysis, and doesn't have a very crisp description of the actual algorithm.

Triangulating between the various sources, I eventually arrived at the following algorithm definition (provided herewith without warranty!):

Barnes Interpolation Algorithm

For the first pass, the estimated value Eg at each grid point gx,y is:

   Eg = ( wi * oi) / ( wi )

  • oi is the value of the i'th observation point
  • wi is the weight (decay function) value for the i'th observation point, defined as
      wi = exp(- di2   /   L2C  )

  • di is the distance from the grid point to the i'th observation point
  • L is the length scale, which is determined by the observation spacing and the natural scale of the phenomena being measured.  
  • C is the convergence factor, which controls how much refinement takes place during each refinement step.  In the first pass the convergence factor is 1. For subsequent passes a value in the range 0.2 - 0.3 is effective.
During refinement passes the estimate at each grid point is re-computed using:

  E'g = Eg + ( wi * (oi - Ei) ) / ( wi )

  • Ei is the estimated value at the grid cell containing the i'th observation point

The target development platform is Java (of course).  For prototyping purposes I used JEQL to read sample datasets, run the algorithm with various parameter choices, and produce plots to visualize the results.  Here's the surface plot for the dataset above:

This looks pretty reasonable, and provides a nice visualization of the large-scale data trend. (Surprise!  It's hotter down south...)  The ultimate goal is much more exciting.  The algorithm will be provided as a Rendering Transformation in GeoServer, allowing surfaces to be generated dynamically from any supported datastore.  This will open the way to providing all kinds of further surface visualizations, including other interpolation techniques - as well as everyone's favourite, Heat Maps (aka Multivariate Kernel Density Estimation).

Now, the pure mathematical description of the algorithm is sufficient for a proof-of-concept implementation producing single images.  But to make it usable in a dynamic way in a high-demand environment, there are more details which need to be worked out to improve performance and functionality.  For instance:
  • How large an extent of data points is necessary to produce a stable surface for a given area?  This is important for dynamic presentation, since zooming, panning and tiling should not change the appearance of the surface.  Of course, all data points could always be used, but more selectivity improves performance.
  • How far should the surface be extrapolated into areas of few or no data observations?  Apart from the validity implications, limiting the extent of the computed surface provides a boost in performance. 
  • A further performance increase may be gained by computing the surface on a coarse grid, then up-sampling in an appropriate way to the display resolution.
  • For data which is global in scope, how does the algorithm need to be changed to accommodate computation on the spheroid?  It seems likely that it's necessary to use a geodetic distance calculation, and there may be other impacts as well.

Friday 27 January 2012

Let's do it again at FOSS4G-NA!

FOSS4G 2011 in Denver last September was such a good time, we're going to do it all over again at FOSS4G North America 2012

It will be interesting to see who and how many attend.  Some conference budgets might have been blown last year - but note that this event is cleverly timed to fall into a new fiscal!  And being in D.C. there should be lots of suits and spooks in attendance.

I'm definitely looking forward to building on the momentum from the last conference, especially given my new gig with OpenGeo (ok, I'm definitely not booth-babe material - but being a booth-geek is probably more fun).