Saturday 18 August 2012

Word frequency using JEQL


Ryan Tomayko has a post on how Ruby recapitulates AWK (or to be more biologically accurate, how it carries vestigial traits which reveal its evolutionary lineage from AWK down through Perl).

He gives an example of how curl, AWK, and sort can be chained together to compute word counts for Swift's A Modest Proposal:

curl -s http://www.gutenberg.org/files/1080/1080.txt |
ruby -ne '
  BEGIN { $words = Hash.new(0) }

  $_.split(/[^a-zA-Z]+/).each { |word| $words[word.downcase] += 1 }

  END {
    $words.each { |word, i| printf "%3d %s\n", i, word }
  }
' |
sort -rn

Back in the day I was an enthusiastic user of AWK.  I was happy to discover that  JEQL can be handily used for similar kinds of text processing, when equipped with suitable string handling and RegEx functions. Here's the word count functionality in JEQL (using a source for the text that is more bot-friendly than Project Gutenberg):

TextReader t file: 
  "http://www.victorianweb.org/previctorian/swift/modest.html";

t = select String.toLowerCase(splitValue) word from t 
      split by RegEx.splitByMatch(line, "[a-zA-Z]+" );

Print select word, count(*) cnt from t 
        group by word order by cnt desc; 

AWK had a bit of a rep for being somewhat write-only.  To my SQL-attuned eyes the JEQL version is more understandable.

4 comments:

Unknown said...

Really nice blog I really appreciate your concern about this topic and I want to share something about Frequency Distribution Table that is In statistics, a frequency distribution is an arrangement of the values that one or more variables take in a sample.

Frank Hardisty said...

Martin,

Inspirational work as always! JEQL looks very interesting, particularly the implementation of table-based programming. Thank you for describing it.

If JEQL is ever released as open source, I'd be over the moon and start to use it immediately. So, please do sing out if that ever looks possible.

regards,
-Frank

Frank Hardisty said...

Martin and All,

Just a quick follow-up. This document:
http://foss4g-na.org/wp-content/uploads/2012/03/JEQL_Language_for_Spatial_Processing_2012.pdf

indicates that JEQL will be open source "soon". So that is very encouraging.

regards,
-Frank

http://tsusiatsoftware.net/jeql/main.html

Dr JTS said...

Frank,

Glad you like the look of JEQL. I am definitely intending to open-source it - and I'll try and make that happen ASAP.