Saturday, 14 February 2009

Huffman Encoding

Huffman Encoding is a simple technique for lossless data compression. The idea is simple; replace frequently occuring symbols with short bit patterns and infrequently occuring symbols with longer ones.

Firstly, we must produce a frequency table that gives weights for each symbols. Here's a version (update 14/2/9) Turns out there was a much better implementation of doing this (frequencies) already in Clojure contrib so I'll use that and save the world from seeing my bad version. Notes on why it was bad at end.

For example:


user> (frequencies "aaaabbbbcdddde")
{\e 1, \d 4, \c 1, \b 4, \a 4}


Once we've got the frequencies, we can construct a Huffman Coding Tree. The algorithm description (from Wikipedia) is:



  1. Create a leaf node for each symbol and add it to the priority queue.
  2. While there is more than one node in the queue:

    1. Remove the node of highest priority (lowest probability) twice to get two nodes.
    2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
    3. Add the new node to the queue.

  3. The remaining node is the root node and the tree is complete.



This tree has the property that the path to each node has a unique prefix. We can translate this directly into Clojure as:


(defn coding-tree
"Given an ordered frequency list, create an encoding tree"
[open]
(prn open)
(if (> (count open) 1)
(let [new-node (apply tree-node (take 2 open))]
(recur (add-to-queue new-node (drop 2 open))))
(first open)))


Where add-to-queue simply inserts a node in the right place. See huffman.clj for full code.

The coding tree isn't enough on its own, we have to change this in to a map from symbol to bit-pattern. To get the bit pattern for any node we start from the root and follow a route to the symbol in question. When we take a left node we get a "1" and a right branch gets a "0". The lookup function takes an encoding tree and flattens it into a map.


(defn- lookup-helper
[tree path]
(if (nil? tree)
nil
(let [v (first (first tree))]
(lazy-cat (if (= v \*) nil (list [v path] ))
(lookup-helper (left-node tree) (cons 0 path))
(lookup-helper (right-node tree) (cons 1 path))))))

(defn lookup
[tree]
(into {} (lookup-helper tree nil)))


Lazy functions ensure that we don't get a stack overflow. The defn- indicates that lookup-helper is a private function.

Finally we need a function that given a sequence and an encoding table gives us the encoded series of bits.


(defn huffman-compress
[s table]
(mapcat (partial get table) s))


Note that the sequence and the encoding table don't have to be the same. If, for example, the data to compress was in the English language, then you could use a known Huffman table based on Frequency Analysis of a typical corpus.

So how much compression can we get? Let's look at an example:



user> (let [x "busy busy bee"]
(compress x (huffman-coding-table x)))
(1 0 0 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 1)

user> (count *1)
34


So "busy busy bee" encoded to 34 bits (*1 is used to refer to the last evaluated expression at the REPL). Compared to the 13*8 bits this would take with ASCII this is a good saving. How do we fair with bigger texts? Let's try Hamlet.


user> (time (let [x (slurp "/home/jfoster/Desktop/2ws2610.txt")]
(count (compress x (huffman-coding-table x)))))
"Elapsed time: 592.317906 msecs"
921595

user> (* 8 (count (slurp "2ws2610.txt")))
1544656


A pretty big saving again (down from ~1.5 million bits to 900000 bits). Note that in all these savings I'm not including the size of the tree!

In this use a symbol is a character, we could use words instead to get bigger savings (we wouldn't have to change the code at all). PKZIP use Huffman in their arsenal of compression techniques (see LZ77 and LZ78 for other examples).

As a side note, why was my version of frequencies less than good? (bad version preserved for posterity here).

  • I didn't know assoc took an optional argument representing the default, this means I could have avoided special casing the first occurrence of a symbol.
  • I didn't use a standard idiom (reduce) when I could have done. Counting frequencies is simply taking a big sequence and making a different one.


I should spend some more time reading source code - any other improvements that you can see are gratefully accepted!