Sunday 8 February 2009

Bit Shifting in Clojure

Base64 encoding is a way of converting a stream of binary data into a printable form. The name comes from the 64 allowable characters ([a-z][A-Z][0-9]+/=) that are used.

The algorithm is very simple. Get 3 bytes at a time (if you can't, just pad with a character, typically =), munge them together (making 24 bits). We then split this 24 bits into 4 lots of 6 bits which allows us to pick one of the 64 allowable characters. This involves dealing with a few of the bit operators in Clojure, as shown below:


(def *encode-table*
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=")

; Daft and way too slow
(defn encode-num
[num]
(let [a (bit-and num 63)
b (bit-shift-right (bit-and num 4032) 6)
c (bit-shift-right (bit-and num 258048) 12)
d (bit-shift-right (bit-and num 16515072) 18)]
(map (fn [x] (nth *encode-table* x )) (list d b c a))))

(defn str-pad [x size padchar]
(let [d (rem (count x) size)]
(if (zero? d)
x
(concat x (take (- size d) (repeat padchar))))))


(defn encode
"Lazily encode a sequence as base64"
[s]
(if (nil? s)
nil
(let [x (map int (str-pad (take 3 s) 3 \=))
num (+ (nth x 2) (* 256 (nth x 1)) (* 256 256 (first x)))]
(lazy-cat (encode-num num) (encode (drop 3 s))))))


The magic numbers in the bit-and allow us to select the right parts of the integer (63 is 111111, 4032 is 111111000000 and so on). This could be improved substantially with more bit-twiddling (see for example String Encoders).

One nice property of this is that by using lazy-cat we can deal with infinite sequences (just as long as you don't try and print the result!)


user> (apply str (take 10 (encode (range 0 1000000000000000000))))
"AEACAQwFBc"


In case you're wondering, I am really scraping the bottom of the barrel for little programming tasks to learn something new at the moment! I've ordered myself a copy of The Princeton Companion to Mathematics which'll hopefully provide more inspiration when it arrives.

(Update 24/2/2009) As Cubix pointed out, there's a bug in the code above as padding should be applied after encoding, not before. The full code is in the comments, but the main point is to change the code such that if we don't get three elements we don't do any encoding and instead use a helper function to do the last bytes. I'm sure there must be a further improvement as padding duplicates functionality in encode.


(defn encode
"Lazily encode a sequence as base64"
[s]
(if s
(let [x (map int (take 3 s))]
(if (= 3 (count x))
(let [num (+ (nth x 2) (* 256 (nth x 1)) (* 256 256 (first x)))]
(lazy-cat (encode-num num) (encode (drop 3 s))))
(padding x))))) ;;; helper function, see comments

9 comments:

  1. Shouldn't you get 24 bits when munging 3 bytes together, then split that into 4 lots of 6 bits?

    ReplyDelete
  2. Ahem, thanks for the correction I've updated the post!

    ReplyDelete
  3. I haven't tested this myself but I believe you can change the encode funtion to be:

    (defn encode
    "Lazily encode a sequence as base64"
    [s]
    (if s
    (let [x (map int (str-pad (take 3 s) 3 \=))
    num (+ (nth x 2) (* 256 (nth x 1)) (* 256 256 (first x)))]
    (lazy-cat (encode-num num) (encode (drop 3 s))))))

    ReplyDelete
  4. I like code,could I be allowed to peep into bit-and bit-shift-right functions. And why are you buying Maths text?

    ReplyDelete
  5. d (bit-shift-right (bit-and num 16515072) 18)]m
    What is 'm' doing there? I can't figure it out?gra

    ReplyDelete
  6. Brian - that's a good idea. I think I could of also said

    (when-not (nil? s)
    ;; blah)

    Which might be a nicer way of expressing it (it gets an automatic nil if not defined).

    Emeka, thanks for pointing out the typo, I removed the crazy "m". The bit-shift functions are all standard Clojure functions (see http://www.clojure.org/api).

    ReplyDelete
  7. Please try out Brain Doyle's code, it would make your code to run faster.

    ReplyDelete
  8. With base64, I don't think you're supposed to pad with the '=' characters before encoding, but at the end. (Before encoding you actually pad with zero bits.) This seems to spit out the same result as the Java code I was testing against:

    (def *encode-table* "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=")

    (defn encode-num
    [num]
    (let [a (bit-and num 63)
    b (bit-shift-right (bit-and num 4032) 6)
    c (bit-shift-right (bit-and num 258048) 12)
    d (bit-shift-right (bit-and num 16515072) 18)]
    (map (fn [x] (nth *encode-table* x )) (list d c b a))))

    (defn encode
    "Lazily encode a sequence as base64"
    [s]
    (if s
    (let [x (map int (take 3 s))]
    (if (= 3 (count x))
    (let [num (+ (nth x 2) (* 256 (nth x 1)) (* 256 256 (first x)))]
    (lazy-cat (encode-num num) (encode (drop 3 s))))
    (padding x)))))

    (defn padding [ints]
    (let [ints-zero-pad (take 2 (concat ints '(0)))]
    (let [num (+ (* 256 256 (nth ints-zero-pad 0)) (* 256 (nth ints-zero-pad 1)))]
    (take 4 (concat (take (+ (count ints) 1) (encode-num num)) (repeat \=))))))

    ReplyDelete
  9. Thanks Cubix, you're right I shouldn't be padding there and thanks for the solution!

    ReplyDelete