Sunday, 8 February 2009

Bit Shifting in Clojure

Base64 encoding is a way of converting a stream of binary data into a printable form. The name comes from the 64 allowable characters ([a-z][A-Z][0-9]+/=) that are used.

The algorithm is very simple. Get 3 bytes at a time (if you can't, just pad with a character, typically =), munge them together (making 24 bits). We then split this 24 bits into 4 lots of 6 bits which allows us to pick one of the 64 allowable characters. This involves dealing with a few of the bit operators in Clojure, as shown below:


(def *encode-table*
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=")

; Daft and way too slow
(defn encode-num
[num]
(let [a (bit-and num 63)
b (bit-shift-right (bit-and num 4032) 6)
c (bit-shift-right (bit-and num 258048) 12)
d (bit-shift-right (bit-and num 16515072) 18)]
(map (fn [x] (nth *encode-table* x )) (list d b c a))))

(defn str-pad [x size padchar]
(let [d (rem (count x) size)]
(if (zero? d)
x
(concat x (take (- size d) (repeat padchar))))))


(defn encode
"Lazily encode a sequence as base64"
[s]
(if (nil? s)
nil
(let [x (map int (str-pad (take 3 s) 3 \=))
num (+ (nth x 2) (* 256 (nth x 1)) (* 256 256 (first x)))]
(lazy-cat (encode-num num) (encode (drop 3 s))))))


The magic numbers in the bit-and allow us to select the right parts of the integer (63 is 111111, 4032 is 111111000000 and so on). This could be improved substantially with more bit-twiddling (see for example String Encoders).

One nice property of this is that by using lazy-cat we can deal with infinite sequences (just as long as you don't try and print the result!)


user> (apply str (take 10 (encode (range 0 1000000000000000000))))
"AEACAQwFBc"


In case you're wondering, I am really scraping the bottom of the barrel for little programming tasks to learn something new at the moment! I've ordered myself a copy of The Princeton Companion to Mathematics which'll hopefully provide more inspiration when it arrives.

(Update 24/2/2009) As Cubix pointed out, there's a bug in the code above as padding should be applied after encoding, not before. The full code is in the comments, but the main point is to change the code such that if we don't get three elements we don't do any encoding and instead use a helper function to do the last bytes. I'm sure there must be a further improvement as padding duplicates functionality in encode.


(defn encode
"Lazily encode a sequence as base64"
[s]
(if s
(let [x (map int (take 3 s))]
(if (= 3 (count x))
(let [num (+ (nth x 2) (* 256 (nth x 1)) (* 256 256 (first x)))]
(lazy-cat (encode-num num) (encode (drop 3 s))))
(padding x))))) ;;; helper function, see comments