Lin.ear th.inking: computation

Showing posts with label computation. Show all posts

Monday, 24 October 2011

(r-i-p (john mccarthy))

John McCarthy, inventor of LISP, has died at age 84.

Having finally balanced his parentheses, his atoms can now be returned to the free list.

Jobs, Ritchie, McCarthy - so many of the greats passing....

Thursday, 7 April 2011

Slope/Aspect/Elevation using JTS

David Skea is a longtime colleague, JTS contributor, and all-around geospatial guru. He's developing a Slope/Aspect/Elevation service to be used in forestry-related applications here in British Columbia. His work is a nice example of using JTS to perform real-world spatial processing.

The key to computing slope, aspect and elevation is to have a Digital Elevation Model (DEM) available for the terrain in the area of interest. In this case, the terrain data is provided by the TRIM irregular DEM, which is a dataset of over 500 million mass points covering all of British Columbia. To compute the required values for a given polygon, the first step is to extract only the mass points in the immediate region of the polygon (i.e. using the polygon envelope with a suitable buffer distance. The JTS Delaunay Triangulation API is used to compute a TIN triangulation from the TRIM mass points.

Using a JTS PointOnGeometryLocator, the triangles whose centroids lie in the polygon are selected. Since each mass point has an elevation value, each TIN triangle is located in 3D space. The normal vector can be computed for each triangle, which provides the slope and aspect values. The elevation is the height of the triangle centroid.

The SEA values for the triangles are averaged to compute the overall slope, aspect and elevation value for the polygon.

Here's a screenshot showing the results, using Google Earth as a convenient 3D visualization tool (click for full-size image).

Tuesday, 28 December 2010

Goodbye LAMP, hello SMAQ

A great article on the emerging SMAQ stack for big data.

I wonder if PIG does spatial?

Friday, 9 July 2010

Is JSON the CSV of the 21st Century?

It strikes me that JSON might be the CSV of the 21st century. Consider these similarities:

They both use POT (plain old text) as their encoding
The basic datatypes are strings and numbers. JSON adds booleans and nulls - Yay for progress!
The only schema metadata supported is field names

JSON has the major advance of supporting hierarchical and array structures. Of course, this makes it correspondingly more difficult to parse.

CSV has stood the test of time extraordinarily well. According to good 'ol Wikipedia it's been around since at least 1967 - that's over 40 years!

And CSV is still well supported whereever tabular data is used. Let's see if JSON is still around in 2035... I suspect not, because the half-life of technologies is a lot shorter these days. (Maybe CSV is the stromatolite of file formats!)

It would be nice if JSON had a standard schema notation. (There is JSON-Schema. It copies the XML Schema idea of encoding the schema in JSON. It remains to be seen whether this makes it as easy to use and popular as XML Schema has been. 8^)

And why, o why, do field names have to be in quotes? Ok, I know why technically - they're just strings, and JSON strings need to be in quotes, because they need to be embeddable in JavaScript code. But this is a classic case of a vestigial artifact which has a detrimental effect in a new environment. Nobody should be evaling JSON as just another chunk of Javascript anyway, for obvious security reasons. And in the wider world of JSON use cases following Javascript synax is completely irrelevant.

YAML seems to have a lot advantages over JSON as a rich textual format. For instance, it has minimal use of quotes, and a richer, extensible set of datatypes including timestamps and binary (WKB, anyone?). But it's going to be pretty hard to dislodge JSON, which is solidly entrenched for all the wrong reasons.

Tuesday, 15 June 2010

Improving DoubleDouble performance with self-modifying methods

In a previous post I discussed how the DoubleDouble extended-precision API provides more robust computation for the Delaunay inCircle test. Quite rightly there were questions about the performance impact of using DoubleDouble. Of course, there is a performance penalty, but it's not as large as you might think, since the inCircle computation is only a portion of the cost of the overall Delaunay algorithm . For larger datasets the performance penalty is about 2x - which seems like an acceptable tradeoff for obtaining robust computation.

But after a bit of thought I realized that there was a simple way to improve the performance of using DoubleDouble. I originally designed the DoubleDouble API to provide value semantics, since this is a 100% safe way of evaluating expressions. However, this requires creating a new object to contain the results of every operation. The alternative is to provide self-modifying methods, which update the value of the object the method is called on. In many arithmetic expressions, it's possible to use self-methods for most operations. This avoids a lot of object instantiation and provides a significant performance benefit (even with Java's ultra-efficient object allocation code).

I added self-method versions of all the basic operations to the DoubleDouble API. While I was at it I threw in operations which took double arguments as well, since this is a common use case and allows avoiding even more allocations (as well as simplifying the code). The self-methods are all prefixed with the word "self", to make them stand out since they are potentially dangerous if used incorrectly. (I suppose a more Java-esque term would be "this-methods", but in this case the Smalltalk argot seems more elegant).

I also decided to rename the class to DD, to improve the readability of the code. Another enhancement might be to shorten the method names to 3 characters (e.g. "mul" instead of "multiply").

As an example, here's the inCircle code in the original implementation and using the improved API. Note the use of the new DD.sqr(double) function to improve readability as well as performance.

Original code


public static boolean isInCircleDDSlow(
 Coordinate a, Coordinate b, Coordinate c,
 Coordinate p) {
DD px = DD.valueOf(p.x);
DD py = DD.valueOf(p.y);
DD ax = DD.valueOf(a.x);
DD ay = DD.valueOf(a.y);
DD bx = DD.valueOf(b.x);
DD by = DD.valueOf(b.y);
DD cx = DD.valueOf(c.x);
DD cy = DD.valueOf(c.y);

DD aTerm = (ax.multiply(ax).add(ay.multiply(ay)))
   .multiply(triAreaDDSlow(bx, by, cx, cy, px, py));
DD bTerm = (bx.multiply(bx).add(by.multiply(by)))
   .multiply(triAreaDDSlow(ax, ay, cx, cy, px, py));
DD cTerm = (cx.multiply(cx).add(cy.multiply(cy)))
   .multiply(triAreaDDSlow(ax, ay, bx, by, px, py));
DD pTerm = (px.multiply(px).add(py.multiply(py)))
   .multiply(triAreaDDSlow(ax, ay, bx, by, cx, cy));

DD sum = aTerm.subtract(bTerm).add(cTerm).subtract(pTerm);
boolean isInCircle = sum.doubleValue() > 0;

return isInCircle;
}

Improved code


public static boolean isInCircleDDFast(
 Coordinate a, Coordinate b, Coordinate c,
 Coordinate p) {
DD aTerm = (DD.sqr(a.x).selfAdd(DD.sqr(a.y)))
   .selfMultiply(triAreaDDFast(b, c, p));
DD bTerm = (DD.sqr(b.x).selfAdd(DD.sqr(b.y)))
   .selfMultiply(triAreaDDFast(a, c, p));
DD cTerm = (DD.sqr(c.x).selfAdd(DD.sqr(c.y)))
   .selfMultiply(triAreaDDFast(a, b, p));
DD pTerm = (DD.sqr(p.x).selfAdd(DD.sqr(p.y)))
   .selfMultiply(triAreaDDFast(a, b, c));

DD sum = aTerm.selfSubtract(bTerm).selfAdd(cTerm).selfSubtract(pTerm);
boolean isInCircle = sum.doubleValue() > 0;

return isInCircle;
}

The table below summarizes the performance increase provided by using self-methods for inCircle. The penalty for using DD for improved robustness is now down to only about 1.5x slowdown.

Timings for Delaunay triangulation using inCircle predicate implemented using double-precision (DP), DoubleDouble (DD), and DoubleDouble with in-place operations (DD-self). All times in milliseconds.

# pts	DP	DD-self	DD
1,000	30	47	80
10,000	334	782	984
10,000	7953	12578	16000

Monday, 26 April 2010

Late night link roundup

Here's a few interesting links that occupied my late-night browsing...

Facebook's Graph API: The Future Of Semantic Web?

This is interesting for two reasons. One is that this is a hardball play by Facebook to subvert many other sites whose business model is linking social networking to categories of cultural artifacts. The other is the implications for the Semantic Web. I'm not so sure that the latter is quite so easily accomplished, but perhaps the 80-20 rule will truly turn out to be key here.

Mahout 0.3: Open Source Machine Learning

Neat stuff. Pulls together some fascinating technologies like Hadoop and clustering techniques. Perhaps the best thing about this is the chance to learn more about how this stuff actually works in practice. It would be great if there's a spatial component to this.

Colt - a set of Open Source Libraries for High Performance Scientific and Technical Computing in Java. One more nail in the old "Java is too slow" canard.

Saturday, 6 March 2010

Open Source Geocoders

One of the more interesting projects we have going on here at Refractions is to build a geocoder for use in a crime-mapping application we are developing for a client. We do have an existing geocoder codebase developed for another project. But we're not 100% happy with its performance and customizability, so we decided to look into developing a new library specifically for this project.

Of course the first thing we did was carry out a technology review of all the open-source geocoders we could find. Here's a list of all the ones we looked at:

Geo-Coder-US - A Perl module developed by the ubiquitous Schuyler Erle. "For geocoding US addresses, that is, estimating the latitude and longitude of any street address or intersection in the United States, using the TIGER/Line data set". Probably no longer being developed, since it has been superseded by

GeoCommons Geocoder::US - a rewrite of Geo-Coder-US into Ruby (and also requiring C and SQLite). "Although it is primarily intended for use with the US Census Bureau’s free TIGER/Line dataset, it uses an abstract US address data model that can be employed with other sources of US street address range data"

JGeoCoder - A Java API loosely modelled after Geo::Coder::US. Works against a SQL database loaded with TIGER data (an H2 image is supplied). Last activity in 2008.

Explorer GeoCoder by SRC - A C++ library for "a data and country independent geocoding engine" which can "assign latitude and longitude coordinates to any United States street address or intersection". Has an active mailing list.

Frost Tiger Geocoder by Stephen Frost et al - a Postgres SQL library for geocoding against TIGER data

(This is summarized in tabular form on this Tsusiat page).

Some observations:

All of the engines implement parsing and matching logic purely in code. None of them provide a declarative description language to allow easy modification of parsing, standardization, and matching rules. (To be fair, this is bit of a tall order. And it's not clear that it's even possible to provide an understandable declarative language for the fully general case. For example, the ArcMap geocoder (which appears to be the old MatchWare engine) provides a geocoding definition language (actually 5 different ones) - but the languages look scarily complex! Nonetheless, this is an important feature for easy of maintenance and customization.)
JGeoCoder uses a large number of complex regular expressions to perform parsing. This looks like it would be difficult to customize, due to the well-known opaqueness of large REs, and perhaps also to the relative inflexibility of the RE paradigm
The GeoCoder::US Ruby module seems to be the simplest code base. (I ended up almost understanding its parsing algorithm 8^) It uses REs, but in a saner amount. However, it's unclear how well it deals with erroneous input data, and how easy it would be to modify for a different address model.
The Explorer geocoder uses a large amount of fairly complex C++ code. It also looked quite challenging to understand and modify.
In all the projects the parser design appears to be fairly ad-hoc and poorly documented. This situation doesn't inspire confidence that it would be possible to modify the parser to support a different address model, or to handle particular kinds of input errors. (GeoCoder::US is a possible exception to this - it has a relatively simple parsing algorithm with at least some documentation).

In the end we decided not to use any of these projects. I'll talk about what we did do in another post.

Tuesday, 21 October 2008

Nostalgic Trivia

When I was but a wee nerdling, I took a course taught by a grizzled veteran of the computer industry. I can no longer remember the subject matter of the course, but I do remember that at one point he referred to the main players in the computer business as "IBM and D'BUNCH". D'BUNCH were DEC, Burroughs, Univac, NCR, Control Data, and Honeywell. (And yes, I had to resort to Wikipedia to remember all these names.)

The moral of the story? Computer manufacturers come and go, but IBM remaineth eternal, apparently.

Dating myself even more, in the early part of my career I used machines made by the first 3 of these. The Univac had the distinction of having the most obtuse, unwieldy, difficult-to-use OS I have ever encountered. DEC, in contrast, had the best OS (of proprietary ones, that is - *nix blows 'em all away).

Thursday, 14 August 2008

Revolution is Happening Now

This great graph shows why we're all going to wrassling with parellelization for the rest of our coding lives:

Source: Challenges and Strategies for High End Computing, Kathy Yelick

Also see this course outline for an sobering/inspiring view of where computation is headed.

Monday, 4 August 2008

Krusty kurmudgeon Knuth kans kores & kommon kode

Andrew Binstock has an interesting interview with Don Knuth.

Professor Knuth makes a few surprising comments, including a low opinion of the current trend towards multicore architectures. Knuth says:

To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers...

But then he goes on to admit:

I haven’t got many bright ideas about what I wish hardware designers would provide instead of multicores...

Which seems to me to make his complaint irrelevant, at best. (Not that I don't sympathize with his frustration about banging our heads against the ceiling of sequential processing speed. And as the sage of combinatorial algorithms he must be more aware than most of us about the difficulties of taking advantage of concurrency.)

Another egregious Knutherly opinion is that he is "biased against the current fashion for reusable code". He prefers what he calls "re-editable code". My thought is that open source gives you both options, and personally I am quite happy to reuse, say, the Java API rather than rewriting it. But I guess when most of your code is developed in your own personal machine architecture (Knuth's MIX) then it's a good thing to enjoy rewriting code!

In the end, however, I have to respect the opinions of a man who has written more lines of code and analyzed more algorithms than most of us have had hot dinners. And I still place him in the upper levels of the pantheon of computer science, whose books all programmers would like to have seen gracing their shelves - even if most of us will never read them!

And yes, this post is mostly an excuse for some korny alliteration - but the interview is still worth a read. (Binstock's blog is worth scanning too - he has some very useful posts on aspects of Java development.)

Thursday, 1 May 2008

Database Architecture monograph

In the era of cloud computing, map/reduce, dynamic languages and the semantic web, the stalwart Relational Database Management System is looking a bit fusty. But RDBMSes were perhaps the earliest widely-deployed example of many of the techniques of distributed computing, concurrent programming, and query optimization that are still highly relevant.

Hellerstein, Stonebraker and Hamilton have published a monograph on Architecture of a Database System. With names like that involved you'd expect high quality and some deep insight, and the article delivers. It's a good, accesible summary of the state-of-the-art in RDBMS technology.

JAQL pegs the cool technology mashup meter

JAQL is a query language and engine which has XQuery-like syntax, SQL-like operators, JSON as a native data format, and runs using the Hadoop ma/preduce framework. (Although not mentioned explicitly, it's probably great for social networking as well...)

You might think that JAQL and JEQL were separated at birth, but they actually have no genetic material in common. But it's interesting to see the J*QL acronym space being rapidly populated. The best two vowels are now gone - who's going to be next to pile in?

Wednesday, 30 April 2008

Taxonomy of Software Bugs

This software bug taxonomy is great! Heisenbugs, Bohr Bugs, and the dreaded Schroedinbugs....

Tuesday, 29 April 2008

Ted Nedward takes on the Tower of Babel

Here's another (as usual) fascinating, detailed, doesn't-this-guy-work-for-a-living post from Ted Nedward. This one starts as a meta-critique of Groovy VS Ruby and morphs into an interesting summary of what the Tower Of Babel IT department is using this year.

Money quote:

I wish I could get back to [C++]for a project in the same way that guys fantasize about running into an old high school girlfriend on a business trip.

Personally, my reaction to my old C++ girlfriend would be "TG I didn't get hitched to this chick - she's way too high-maintenance". Although Ted says she's changed...

Friday, 18 April 2008

Who's conspicuously absent from the PaaS fray?

Here's a hint: who was using the slogan "The Network is the Computer" 10 years ago? And who was the first company to deliver a RIA technology?

So why have they been MIA in the PaaS goldrush?

Let's think about this another way. What database are people most likely to run on their slice -o'Linux-in-the-cloud? mySQL perhaps? Which was just bought by...?

The Register has an article about a possible JavaOne announcement about how this situation might change (with a leaked slide presentation! Fell off the back of an ftp packet, I guess..)

(The weird thing is is that the presentation talks only about PostgreSQL. An old file? Or a different corporate camp? Didn't get the memo maybe?)

Tuesday, 15 April 2008

Is that cloud on the horizon going to start raining applications?

Timothy O'Brien speculates that the transition to cloud-based computing is happening sooner than expected. He's talking about the new integration between Salesforce.com (which is apparently the poster child for SaaS) and Google Apps (the poster child for desktop replacement by the Web). And he generalizes this to include EC2, SimpleDB, and the "twenty or thirty other companies that are going to join the industry".

He also warns here that this transition could transform the model for software development in ways uncomfortable for IT professionals.

He could be right. Cloud computing does seem to be poised to finally provide the right platform to suck the juice out of corporate data centres. The idea of virtual everything certainly has an appeal (especially to someone like me who is basically a software guy).

But questions occur... Salesforce and Google seem like a perfect match - but what about the other companies that want a piece of this action? Does it matter that you will have to commit everything to a given cloud platform? And what happens if that platform goes away? The more advantage you take of the cloud, the bigger the pain when it disappears. And what about apps which are a bit more specific than CRM (which in my naive view seems like just a fancy Contacts list - and hence an obvious and easy thing to integrate with an office suite).

Tim would probably call these kinds of questions "self-interested observations from one with the most to lose". He mentions a Salesforce meeting where business types applaud a sign showing "Software" with a big red slash through it... Well, maybe. Last I noticed no-one has quite managed to automate generating code from requirements documents (let alone automating the generation of implementable requirments documents out of people's heads 8^). So I would say it's more like "different software" than "no software".

One thing's for sure.. there's going to be some gigantic platform turf wars going on up there in the stratosphere.

(One big disappointment - it sounds like the Salesforce platform is based on their proprietary Apex language. Ugh. Just what the world needs - one more language to debate over. At least Google App Engine picked a real language for their launch!)

Saturday, 22 March 2008

Has Automation affected YOUR job?

Can't resist posting this (from Modern Mechanix).

I'm definitely enjoying not having grimy hands, but what happened to the shorter work week and more leisure time?

Wednesday, 30 January 2008

The End of an Architectural Era?

Stonebraker et al have an interesting paper pointing out the antiquity and consequent limitations of the classic relational database architecture in today's world of massive disk/cycles/core.

If they are correct (and in spite of the recent MapReduce blunder Stonebraker has made a lot of great calls in the DB world), the world of data management is going to get awfully interesting in the coming years. The DBA's & DA's of this world have been living a relatively comfortable existence compared to those who are wandering across the stormy badlands of the middle tier. But Stonebraker postulates at least 5 radically variant database architectures to address specific use cases of data management. This would seem to lead to a much more complex world for data architects. But maybe a windfall for the consultants who find their niche?

Monday, 19 November 2007

Enumerating permutations using inversion vectors and factoradic numbering

Permutations are a fundamental concept in discrete mathematics, and crop up in numerous places in computer science. The set of all possible permutations of n items is usually denoted S_n (since the set forms a group which is isomorphic to the Symmetric group of order n). The number of permutations in this set is n!.

This apparently simple concept poses some unexpected challenges. For instance, the following useful operations are messy to compute via the brute-force way of enumerating successive permutations:

List all the permutations in S_n (in lexicographic order)
Find the i'th permutation in S_n

This would be straightforward if there was a way of effectively enumerating all permutations; i.e. a mapping between S_n and the integers {1, 2, ..., n!}.

It turns out that this can be done using a pair of ingenious concepts known as the inversion vector and factoradic numbering. These can be used to provide an enumeration of S_n,conveniently in lexicographic order. (Inversion vectors are also known as inversion tables and Lehmer codes - the term vector seems to me to be most appropriate for the nature of the structure.)

A permutation can be written as a sequence of numbers

P = (p₁ p₂ ... p_n) where p_i is in {1, 2, ..., n}

An inversion is a pair (p_i, p_j) where i < j but p_i > p_j. We define d_i to be the number of inversions in the sequence where the first element is p_i:

d_i = | { p_j : i < j , p_i > p_j } |

Then the inversion vector for the permutation is the sequence

D = (d₁, d₂, ..., d_n) (d_i is in the range {0, 1, ..., n-i})

It is fairly straightforward to recover the original permutation from the inversion vector. This paper gives a definition of the forward and reverse mapping algorithms.

There are n! possible sequences D. They are in one-to-one relationship with the set of permutations S_n, and their natural order gives a lexicographic ordering of S_n. So we're halfway to the enumeration that we are looking for.

Inversion vectors can in turn be mapped to the integers in {1, ..., n!} by using the concept of factoradic numbering. A factoradic number is a number made up of a sequence of digits where the radix for digit d_i is (n-i)!. This Wikipedia article explains the concept in detail.

Here's an example:

P = (4 6 2 5 3 1 8 7) is a permutation in S₈.

Its inversion vector is L(P) = [3,4,1,2,1,0,1,0].

The factoradic representation of this is

3×7! + 4×6! + 1×5! + 2×4! + 1×3! + 0×2! + 1×1! + 0×0! = 18175

So inversion vectors and factoradic numbering provide an elegant way of enumerating permutations. But what does this have to do with geospatial processing? That will be a subject for another post...

Thursday, 13 September 2007

Quote of the day - C. A. R. Hoare

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.

- C. A. R. Hoare

Monday, 24 October 2011

Thursday, 7 April 2011

Tuesday, 28 December 2010

Friday, 9 July 2010

Tuesday, 15 June 2010

Monday, 26 April 2010

Saturday, 6 March 2010

Tuesday, 21 October 2008

Thursday, 14 August 2008

Monday, 4 August 2008

Thursday, 1 May 2008

Wednesday, 30 April 2008

Tuesday, 29 April 2008

Friday, 18 April 2008

Tuesday, 15 April 2008

Saturday, 22 March 2008

Wednesday, 30 January 2008

Monday, 19 November 2007

Thursday, 13 September 2007

About Me

Martin Davis

Popular Posts

Blog Roll

Blog Archive

Labels

Links

Books

Blog Roll

Cloudy Blog Roll