Wednesday, 29 April 2009

Levenshtein Distance in Clojure

The Levenshtein Distance is an algorithm for computing the edit distance between two strings. The algorithm gives you a number which represents the number of substitutions, removals and insertions needed to transform one string into another. This has applications in fields as diverse as bioinformatics (distance between two genome sequences) and spell checking (finding out what you meant to type).

The algorithm can be stated very simply for two sequences a and b

  1. If empty(a), then length(b) (length(b) insertions are required)

  2. If empty(b), then length(b) (length(a) insertions are required)

  3. otherwise, calculate the minimum of:

    1. the distance between (rest a) (rest b) + 1 if a[0] = b[0] (substitution)
    2. the distance between a (rest b) (insertion)
    3. the distance between (rest a) a (deletion)

We can state this naively in Clojure.

(defn cost [a b]
(if (= a b) 0 1))

(defn levenshtein-distance
"Calculates the edit-distance between two sequences"
[seq1 seq2]
(empty? seq1) (count seq2)
(empty? seq2) (count seq1)
:else (min
(+ (cost (first seq1) (first seq2)) (levenshtein-distance (rest seq1) (rest seq2))) ;; substitution
(inc (levenshtein-distance (rest seq1) seq2)) ;; insertion
(inc (levenshtein-distance seq1 (rest seq2)))))) ;; deletion

At first glance, this appears OK. The code works:

user> (levenshtein-distance "kitten" "sitting")

And even better, the code actually looks like the specification of the problem. Looking closer, there are a number of problems. It's not tail recursive so the stack is definitely going to be blown for larger strings. Secondly, it has incredibly bad performance due to the recursion repeating the work.

How do we quantify incredibly badly? Let's assume that the cost of a single calculation (without the recursive calls) is 1. Then we can express the cost for any given pair as:

T(X,Y) = 1 + T(X-1,Y-1) + T(X-1,Y) + T(X,Y-1)

X,Y represent the length of the string. We can express this too in some Clojure code.

(defn calculate-cost [x y]
(= x 0) 1
(= y 0) 1
(calculate-cost (dec x) (dec y))
(calculate-cost x (dec y))
(calculate-cost (dec x) y)

Notice that the structure of this function is almost exactly the same as the structure of the function we're trying to measure! So how expensive is this implementation of Levenshtein:

user> (calculate-cost 3 3)
user> (calculate-cost 4 4)
user> (calculate-cost 6 6)
user> (calculate-cost 7 7)
user> (calculate-cost 8 8)

This is an exponential cost function. This basically renders this implementation all but useless for anything other than toy size strings.

So how can we improve this algorithm whilst keeping the behaviour that it actually looks like the definition of the problem? That can wait for next time!