Saturday 18 July 2009

Parsing Logs with Map/Reduce

Map/Reduce is a software pattern for processing large data sets. It's been popularized by Google, and has now inspired other frameworks, such as Hadoop and CouchDB.

Map/Reduce is based on two very simple functions. Map simply converts values from one domain to the other. reduce (also known as fold) takes a sequence of values and converts them into a single value. For example (reduce + [1 2 3 4 5]) translates to 1 + 2 + 3 + 4 + 5. Distribution is simple to achieve for map because it is functional (no side effects, translating one value doesn't affect any other). Reduce usually operates on particular keys, allows the results to be computed in parallel (though this depends on some logic when distributing workloads).

I recently had to parse some log files and get an idea of the distribution of some timing information. Usually, I'd knock together a Perl script or just use Excel, but I thought I'd try it with a "functional programming" slant with Clojure.

The log file was in the following format.


Event 23435
3.4
Event 23511
32


And so on. The actual events are of no consequence, I just wants to get an idea of what the distribution was like. Using clojure.contrib.duck-streams file reading is very simple (read-lines returns a lazy sequence of lines from the file).


(map second (partition 2 (read-lines filename)))


(update: Alternatively (take-nth 2 (drop 1 (read-lines filename)) does the same thing and feels nicer).

This returns a lazy list of all the timing information. Now all I need to do is get a distribution. All we do is use reduce to build up an association between the value of each element (floored to be within a range) and the number of times it occurs. The code below bunches it up into 10 second chunks and counts the occurrences of each.



Less than 10 lines of code to produce a histogram of a silly log file format. And using Map/Reduce, I could scale exact same example up to run on hundreds of gigabytes of data on huge clusters of computers.

This is exactly why, as a software developer, you should be excited about programming languages (not just functional programming!). The quote below is from Joel Spolsky and sums it up well.


Without understanding functional programming, you can't invent MapReduce, the algorithm that makes Google so massively scalable. The terms Map and Reduce come from Lisp and functional programming. MapReduce is, in retrospect, obvious to anyone who remembers from their 6.001-equivalent programming class that purely functional programs have no side effects and are thus trivially parallelizable. (The Perils of Java Schools)