Wednesday 29 April 2009

Levenshtein Distance in Clojure

The Levenshtein Distance is an algorithm for computing the edit distance between two strings. The algorithm gives you a number which represents the number of substitutions, removals and insertions needed to transform one string into another. This has applications in fields as diverse as bioinformatics (distance between two genome sequences) and spell checking (finding out what you meant to type).

The algorithm can be stated very simply for two sequences a and b

  1. If empty(a), then length(b) (length(b) insertions are required)

  2. If empty(b), then length(b) (length(a) insertions are required)

  3. otherwise, calculate the minimum of:

    1. the distance between (rest a) (rest b) + 1 if a[0] = b[0] (substitution)
    2. the distance between a (rest b) (insertion)
    3. the distance between (rest a) a (deletion)



We can state this naively in Clojure.


(defn cost [a b]
(if (= a b) 0 1))

(defn levenshtein-distance
"Calculates the edit-distance between two sequences"
[seq1 seq2]
(cond
(empty? seq1) (count seq2)
(empty? seq2) (count seq1)
:else (min
(+ (cost (first seq1) (first seq2)) (levenshtein-distance (rest seq1) (rest seq2))) ;; substitution
(inc (levenshtein-distance (rest seq1) seq2)) ;; insertion
(inc (levenshtein-distance seq1 (rest seq2)))))) ;; deletion



At first glance, this appears OK. The code works:


user> (levenshtein-distance "kitten" "sitting")
3
user>


And even better, the code actually looks like the specification of the problem. Looking closer, there are a number of problems. It's not tail recursive so the stack is definitely going to be blown for larger strings. Secondly, it has incredibly bad performance due to the recursion repeating the work.

How do we quantify incredibly badly? Let's assume that the cost of a single calculation (without the recursive calls) is 1. Then we can express the cost for any given pair as:


T(X,Y) = 1 + T(X-1,Y-1) + T(X-1,Y) + T(X,Y-1)


X,Y represent the length of the string. We can express this too in some Clojure code.


(defn calculate-cost [x y]
(cond
(= x 0) 1
(= y 0) 1
:else
(+
(calculate-cost (dec x) (dec y))
(calculate-cost x (dec y))
(calculate-cost (dec x) y)
1)))


Notice that the structure of this function is almost exactly the same as the structure of the function we're trying to measure! So how expensive is this implementation of Levenshtein:


user> (calculate-cost 3 3)
94
user> (calculate-cost 4 4)
481
user> (calculate-cost 6 6)
13483
user> (calculate-cost 7 7)
72958
user> (calculate-cost 8 8)
398593


This is an exponential cost function. This basically renders this implementation all but useless for anything other than toy size strings.

So how can we improve this algorithm whilst keeping the behaviour that it actually looks like the definition of the problem? That can wait for next time!

Upgrading to Ubuntu 9.04

Ubuntu 9.04 was released in April. It doesn't have a huge amount of new features, mostly incremental updates (amongst these improved multi-monitor support which was my biggest pain before).

Upgrading from 8.10 was completely seamless and just involved one click (take note Windows XP upgrades!). And best of all, it comes with the Dust Theme available which looks pretty awesome.

Ubuntu Dust Theme

Sunday 26 April 2009

The Mod Operator and Zeller's Congruence

Java's built in mod operator (%) behaves differently than the Clojure mod function with negative operators

Java:System.out.println("5 % 2 = " + (5 % 2)); // 1
Clojure:(mod -7 2) ; 1

Java:System.out.println("-2 % 7" + (-2 % 7)); // -2
Clojure:(mod -2 7) ; 5
Clojure:(rem -2 7) ; -2


The Java behaviour is explained more clearly by the JLS, here. rem should be the Clojure equivalent if you want to support the same behaviour as Java.

Why would the behaviour of negative numbers and the mod operator ever be useful? One algorithm that uses this is Zeller's Congruence which allows you to determine the day of the week if you know the date.

Defining the algorithm in Clojure is simple:


(def days
{0 'Saturday
1 'Sunday
2 'Monday
3 'Tuesday
4 'Wednesday
5 'Thursday
6 'Friday})

(defn zeller
"Zeller's congruence"
[day-of-month month year century]
(get days
(mod (+ day-of-month
(int (/ (* 26 (inc month)) 10))
year
(int (/ year 4))
(int (/ century 4))
(- (* 2 century))) 7)))


The algorithm is slightly strange in that January and February need to be treated as the 13th and 14th of the previous month. We can encapsulate this ugliness behind a nicer interface.


(def months
['January
'February
'March
'April
'May
'June
'July
'August
'September
'October
'November
'December])

(def month-to-number
(zipmap months [13 14 3 4 5 6 7 8 9 10 11 12]))


(defn day-of-week
[day month year]
(let [month-num (get month-to-number month)
year-int (if (or (= month-num 13) (= month-num 14)) (dec year) year)]


Now we can try this out:


user> (day-of-week 22 'November 1963) ; JFK assassination
Friday

user> (day-of-week 20 'July 1969) ; Apollo Moon landing
Sunday

user> (day-of-week 27 'April 2009)
Monday


Of course, any sane person would actually use a built in library function to do this! My rambling point is just that, Java % is not the same as Clojure mod!

Wednesday 22 April 2009

Understanding the Y Combinator

Like monads, understanding the Y Combinator is a rite of passage for the aspiring functional programmer. So here's my take on it, using Clojure.

A fixed point combinator is a function g which produces such a fixed point p for any function f.


g(f)=p
f(p)=p


Why is this important or interesting for a programming language? It's important because it allows you to describe recursion in terms of rewrites, rather than computation steps. This allows for anonymous recursive functions.

So how do we write the Y combinator in Clojure? Taking the lambda-calculus definition and switching the word lambda for fn, gives us this incomprehensible mess!


(defn Y [r]
((fn [f] (f f))
(fn [f]
(r (fn [x] ((f f) x))))))


But what does it actually mean? Probably the best way to understand is with a real example and to do that, we need a recursive function. Given that everyone else uses factorial, we'll use a simpler recursive function that sums up a sequence.


(defn sum-seq [x]
(if (empty? x)
0
(+ (first x) (sum-seq (rest x)))))


In the definition above, we've written the function with explicit recursion. We'll use the Y combinator to get rid of that!

The first step is to rethink the function in terms of a series of functions. Our goal is to create a way of expressing this computation as a series of function calls. We want to write a function that given a function to compute the sum of a sequence, gives us the next function to compute the sum of a sequence.


(defn sum-seq-fn-gen [func]
(fn [s]
(if (empty? s)
0
(+ (first s) (func (rest s))))))


So now we have such a function. We can already use it:


user> ((sum-seq-fn-gen nil) [])
0

user> ((sum-seq-fn-gen (sum-seq-fn-generator nil)) [9])
9

user> ((sum-seq-fn-gen (sum-seq-fn-gen (sum-seq-fn-gen nil))) [1 9])
10


It's not that useful, as at the moment we'd have to type in an awful lot of characters to sum a large list. What we really want is some way of calculating the fixed point of such a function. Thankfully we already have that, thanks to the Y combinator.


user> ((Y sum-seq-fn-gen) [1 2 3 4 5])
15

user> ((Y sum-seq-fn-gen) (range 0 1000))
499500


So the Y Combinator has given us what we need. Given a function and an input, find the fixed point. Note that there is no explicit recursion going on. sum-seq-fn-gen could be an anonymous function.


user> ((Y
(fn [func]
(fn [s]
(if (empty? s)
0
(+ (first s) (func (rest s))))))) [1 2 3 4 5])
15


The best way to understand the Y combinator is to see how it's used and then run through the fun exercise of expanding it and seeing what happens. I found the links below useful:

Monday 20 April 2009

Intentional Software

Intentional Software is a software company set up by former Microsoft bod, Charles Simonyi. It aims to revolutionize software by minimizing the impedance mismatch between a business problem and the code that solves it. Big aims indeed.

They gave a talk recently which Martin Fowler has blogged/bliked(?) about here together with some nonsensical tweets:


To gauge the reaction, take a look at Twitter.

  • @pandemonial Quite impressed! This is sweet! Multiple domains, multiple langs, no question is going unanswered
  • @csells OK, watching a live electrical circuit rendered and working in a C# file is pretty damn cool.
  • @jolson Two words to say about the Electronics demo for Intentional Software: HOLY CRAPOLA. That's it, my brain has finally exploded.
  • @gblock This is not about snazzy demos, this is about completely changing the world we know it.
  • @twleung ok, the intellisense for the actuarial formulas is just awesome
  • @lobrien This is like seeing a 100-mpg carburetor : OMG someone is going to buy this and put it in a vault!

    Afterwards a couple of people said it was the most important demo they'd ever seen, comparing it even to the Mother of all Demos. For many there was a sense that the whole world of software development had just changed.


  • This all seems a bit much!

    The approaches advocated by Intentional, JetBrains MPS, Model-Driven architecture and Software Factories all suggest that you can specify what you want in some higher-level than code (i.e. in a business domain language), turn the key and get your working product. The biggest problem I see is that requirements change frequently. How does agility fit in with the language workbench model?

    Monday 6 April 2009

    Proebstring's Law

    An interesting law from Todd Proebsting which I hadn't heard before:

    advances in compiler optimizations double computing power every 18 years


    Compare this to Moore's Law and it suggests that research into compiler optimization isn't going to get as much say programmer productivity (better tools for writing, analysing and understanding code). This is similar to the arguments made in favour of functional programming languages. It doesn't matter if functional languages are any slower at run-time; the development time improvement means they win overall.

    Saturday 4 April 2009

    Twitter using Scala

    Twitter was previously using Ruby for large parts of the application, but has recently introduced some parts of Scala.

    An interview with some of the developers over at Artima discusses some of the reasoning there.

    It's an interesting decision, seemingly motivated by the need for performance and static typing. Let's look at the static typing reasoning:

    As our system has grown, a lot of the logic in our Ruby system sort of replicates a type system, either in our unit tests or as validations on models. I think it may just be a property of large systems in dynamic languages, that eventually you end up rewriting your own type system, and you sort of do it badly. You’re checking for null values all over the place. There’s lots of calls to Ruby’s kind_of? method, which asks, “Is this a kind of User object? Because that’s what we’re expecting. If we don’t get that, this is going to explode.”


    I'm not sure I agree with this reasoning of the Twitter guys description. Firstly, Ruby code with a bunch of kind_of? littered everywhere is almost certainly bad. Perhaps this is because as the system has evolved the initial design decisions are not quite right, and the system hasn't evolved and accrued too much technical debt in the process. The benefits of hindsight mean that you could probably rewrite bits in any language and see improvements.

    Secondly, what does replicating a type-system mean? Before we can replicate a type system, we need to know what one is:

    [A type system is a] tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute. (Pierce 2002).


    This sounds an awful lot like how I define a test; I'm looking to test that certain behaviours hold true under whatever conditions I write my test for.

    Has using Scala static-typing meant that you can write any less tests? The flavour of errors that static typing can detect is pretty large but it's not exhaustive, you've still got to write tests to show that some behaviours work as expected. I can't find any information on this, but this would be really good information to get!

    The other argument seems to be performance. Scala (and Clojure) leverage the JVM so have the advantage of 10 years+ of development on that. There's a version of Ruby for the JVM (JRuby), but that was rejected because some third-party Ruby libraries depended on native C libraries that aren't compatible.

    I'm not sure where I sit yet on the dynamic / static typing debate, and to be honest I'm not sure it's that interesting given the tests as types argument (unless the amount of tests decreases significantly if you use types). I don't think the Artima article tells us any of the real answers of whether one is better than the other for Twitter. As I mentioned before, rewriting it in any language is probably going to result in cleaner, clearer code just because of the accumulated knowledge of the past one.

    Implementing Minilight in Clojure (2)

    Any 3D rendering rubbish seems to require a 3d vector class of some description. A quick Google Code Search brings up thousands of results.

    The Minilight code is no exception.

    For now, my Clojure version will just use the built in vec based sequence. I'm sure that I'll find I'll need raw Java array access, but it's inconvenient to start with the ugly stuff.

    ANSI Common Lisp mentions this explicitly,
    "Lisp is really two programming languages, a language for writing fast programs and a language for writing programs fast."
    I think the same applies to Clojure, I know I can always get the raw speed of Java when I need it, but I can stay nice and high level for most of the time (80/20 rule and all that).

    So here's the most basic set of vector like functions I could write for Clojure. The cross product definition is the ugliest out there. Probably because it's only valid in 3 (or 7 according to Wikipedia!) dimensions!



    (defn dot
    "Dot product of two vectors"
    [v1 v2] (reduce + (map * v1 v2)))

    (defn cross
    "Cross product of two 3 vectors"
    [v1 v2]
    (vec
    (- (* (get v1 1) (get v2 2)) (get v1 2) (get v2 1))
    (- (* (get v1 2) (get v2 0)) (get v1 0) (get v2 2))
    (- (* (get v1 0) (get v2 1)) (get v1 1) (get v2 0))))

    (defn unary-minus
    "Negation of a vector"
    [v] (vec (map - v)))

    (defn unitize
    [v]
    (let [len (Math/sqrt (dot v v))
    overlen (if (not (zero? len)) (/ 1 len) 0)]
    (vec (map (fn [x] (* overlen x)) v))))

    (defn between
    [mini maxi x]
    (min (max x mini) maxi))

    (defn get-clamped
    "Clamp v within the bounds minimum and maximum"
    [v mini maxi]
    (vec (map between mini maxi v)))