Firstly, we must produce a frequency table that gives weights for each symbols.
frequencies
) already in Clojure contrib so I'll use that and save the world from seeing my bad version. Notes on why it was bad at end.For example:
user> (frequencies "aaaabbbbcdddde")
{\e 1, \d 4, \c 1, \b 4, \a 4}
Once we've got the frequencies, we can construct a Huffman Coding Tree. The algorithm description (from Wikipedia) is:
- Create a leaf node for each symbol and add it to the priority queue.
- While there is more than one node in the queue:
- Remove the node of highest priority (lowest probability) twice to get two nodes.
- Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
- Add the new node to the queue.
- The remaining node is the root node and the tree is complete.
This tree has the property that the path to each node has a unique prefix. We can translate this directly into Clojure as:
(defn coding-tree
"Given an ordered frequency list, create an encoding tree"
[open]
(prn open)
(if (> (count open) 1)
(let [new-node (apply tree-node (take 2 open))]
(recur (add-to-queue new-node (drop 2 open))))
(first open)))
Where
add-to-queue
simply inserts a node in the right place. See huffman.clj for full code. The coding tree isn't enough on its own, we have to change this in to a map from symbol to bit-pattern. To get the bit pattern for any node we start from the root and follow a route to the symbol in question. When we take a left node we get a "1" and a right branch gets a "0". The
lookup
function takes an encoding tree and flattens it into a map.
(defn- lookup-helper
[tree path]
(if (nil? tree)
nil
(let [v (first (first tree))]
(lazy-cat (if (= v \*) nil (list [v path] ))
(lookup-helper (left-node tree) (cons 0 path))
(lookup-helper (right-node tree) (cons 1 path))))))
(defn lookup
[tree]
(into {} (lookup-helper tree nil)))
Lazy functions ensure that we don't get a stack overflow. The
defn-
indicates that lookup-helper
is a private function.Finally we need a function that given a sequence and an encoding table gives us the encoded series of bits.
(defn huffman-compress
[s table]
(mapcat (partial get table) s))
Note that the sequence and the encoding table don't have to be the same. If, for example, the data to compress was in the English language, then you could use a known Huffman table based on Frequency Analysis of a typical corpus.
So how much compression can we get? Let's look at an example:
user> (let [x "busy busy bee"]
(compress x (huffman-coding-table x)))
(1 0 0 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 1 1 0 0 0 1 0 0 1)
user> (count *1)
34
So "busy busy bee" encoded to 34 bits (
*1
is used to refer to the last evaluated expression at the REPL). Compared to the 13*8 bits this would take with ASCII this is a good saving. How do we fair with bigger texts? Let's try Hamlet.
user> (time (let [x (slurp "/home/jfoster/Desktop/2ws2610.txt")]
(count (compress x (huffman-coding-table x)))))
"Elapsed time: 592.317906 msecs"
921595
user> (* 8 (count (slurp "2ws2610.txt")))
1544656
A pretty big saving again (down from ~1.5 million bits to 900000 bits). Note that in all these savings I'm not including the size of the tree!
In this use a symbol is a character, we could use words instead to get bigger savings (we wouldn't have to change the code at all). PKZIP use Huffman in their arsenal of compression techniques (see LZ77 and LZ78 for other examples).
As a side note, why was my version of
frequencies
less than good? (bad version preserved for posterity here).- I didn't know
assoc
took an optional argument representing the default, this means I could have avoided special casing the first occurrence of a symbol. - I didn't use a standard idiom (reduce) when I could have done. Counting frequencies is simply taking a big sequence and making a different one.
I should spend some more time reading source code - any other improvements that you can see are gratefully accepted!
Cool.
ReplyDeleteIf you wrote the Huffman encoded hamlet text out to disk as a binary file it would occupy 921595 bits.
However at run-time the JVM uses a java.lang.Integer(probably 32 bits) for each 1 or 0 and therefore your memory usage is 32 * 921595 bits or ~25.5MB which is ~17X the original size !
Instead of 1 and 0, you could use (boolean true) or (boolean false). This would likely still be larger than the original text in memory because Clojure's lists are being used. I haven't tried this.
Another solution would be to use Java's BitSet class instead.
http://java.sun.com/javase/6/docs/api/java/util/BitSet.html