Saturday 18 August 2012

Battle of the SSLs - Round 3

After the warmup of rounds 1 & 2 of the Battle of the Spatial Scripting Languages, it's time to pick up the pace!  The last two rounds were relatively simple tasks which any ETL system worth its keep should be able to tackle.  So here's a more interesting problem:
Given a KML file of volcano locations, and a Shapefile of country boundaries, determine the countries with the greatest density of volcanoes.

It's slightly contrived, but contains a nice variety of spatial and data manipulation tasks, including reading spatial data formats, spatial joins, and grouping and sorting.

Here's the JEQL script:

KMLReader tvol
  file: "";
ShapefileReader tworld file: "geocommons-world.shp";
Mem tworld;

tc = select NAME, first(AREA) area, count(*) num 
    from tvol join tworld 
      on GeomPrep.contains(tworld.GEOMETRY, tvol.geometry) 
    group by NAME;

td = select NAME, num, area, Val.toDouble(num)/area density 
    from tc where area > 0 
    order by density desc 
    limit 20;

HtmlWriter td file: "vol_density.html";

A few points to notice:
  • JEQL can read KML/KMZ files from the Web
  • JEQL's way of  chaining select statements together is more readable and maintainable than the regular SQL nested syntax
  • The first() aggregate function in JEQL allows selecting a value from a non-aggregated column.  This is often very awkward and inefficient to do in regular SQL.
  • The GeomPrep function set uses the JTS PreparedGeometry API to optimize repeated geometry predicate evaluation
And here's the places where you might not want to invest in hillside property:

Dominica              4    75 0.0533333333333333
Grenada               1    34 0.0294117647058824
Tonga                 2    72 0.0277777777777778
St. Vincent and the   1    39 0.0256410256410256
St. Lucia             1    61 0.0163934426229508
Martinique            1   106 0.0094339622641509
Comoros               2   223 0.0089686098654709
El Salvador          17  2072 0.0082046332046332
Western Samoa         2   283 0.0070671378091873
Guadeloupe            1   169 0.0059171597633136
Reunion               1   250 0.004
Vanuatu               4  1219 0.0032813781788351
Cape Verde            1   403 0.0024813895781638
Iceland              24 10025 0.0023940149625935
Solomon Islands       6  2799 0.0021436227224009
Guatemala            21 10843 0.0019367333763719
Costa Rica            9  5106 0.0017626321974148
Japan                64 36450 0.0017558299039781
Armenia               4  2820 0.0014184397163121
Nicaragua            16 12140 0.0013179571663921

Word frequency using JEQL

Ryan Tomayko has a post on how Ruby recapitulates AWK (or to be more biologically accurate, how it carries vestigial traits which reveal its evolutionary lineage from AWK down through Perl).

He gives an example of how curl, AWK, and sort can be chained together to compute word counts for Swift's A Modest Proposal:

curl -s |
ruby -ne '
  BEGIN { $words = }

  $_.split(/[^a-zA-Z]+/).each { |word| $words[word.downcase] += 1 }

  END {
    $words.each { |word, i| printf "%3d %s\n", i, word }
' |
sort -rn

Back in the day I was an enthusiastic user of AWK.  I was happy to discover that  JEQL can be handily used for similar kinds of text processing, when equipped with suitable string handling and RegEx functions. Here's the word count functionality in JEQL (using a source for the text that is more bot-friendly than Project Gutenberg):

TextReader t file: 

t = select String.toLowerCase(splitValue) word from t 
      split by RegEx.splitByMatch(line, "[a-zA-Z]+" );

Print select word, count(*) cnt from t 
        group by word order by cnt desc; 

AWK had a bit of a rep for being somewhat write-only.  To my SQL-attuned eyes the JEQL version is more understandable.

Thursday 2 August 2012

JTS helps ESRI with big data problem

Can't help but feeling a little smug about this post on using JTS in a Hadoop process for generating heatmaps for demographic data.

He only seems to be using JTS for generating point buffers, which is hardly a challenging use case.  But if having a fast point buffer algorithm is a key metric I'm not going to complain.

As usual the link for JTS is pointing to the old Vivid site - the current one is this.