Tuesday 24 February 2009

Bloom Filters

Bloom Filters are an efficient data structure for determining whether an item is a member of a set. It's a probabilistic set, it's guaranteed to never return a false negative BUT can sometimes falsely report that an item is in the set.

The Bloom filter doesn't store the data in the structure, instead it uses a bit array. There are two operations on a Bloom filter:

  1. Add - as the name suggests adds a new element to the filter.
  2. Query - returns whether an item is in the set or not


A series of hash functions (ideally independent) are used to calculate a number of indices within the bit-array. When we add something we set the various bit indices, and when we query we check whether all these bit indices are set.


(defstruct bloom-filter :hashfns :value)

(defn make-bloom-filter
([n] (struct bloom-filter md5-hashes (bit-array n)))
([n fns] (struct bloom-filter fns (bit-array n))))

(defn add!
[bloom n]
(let [hashes (map (fn [x] (x n)) (bloom :hashfns))]
(doseq [x hashes] (set-bit! (bloom :value) x 1))
bloom))

(defn query
[bloom n]
(let [hashes (map (fn [x] (x n)) (bloom :hashfns))]
(reduce bit-and (map (fn [z] (get-bit (bloom :value) z)) hashes))))


The gotcha for me was remembering to use doseq for side effects. If instead I'd used map I'd have (and was) in trouble because it wasn't evaluated. doseq forces the evaluation.

One simple choice for the hashes is to use MD5 hash values and split it. MessageDigest allows you to calculate various hash functions.


(ns bloom
(:use bitarray)
(:use clojure.contrib.test-is)
(:import (java.security MessageDigest)))

(defn pad [n s]
(let [padding (- n (count s))]
(apply str (concat (apply str (repeat padding "0")) s))))

(defn md5-hash [s]
(let [m (MessageDigest/getInstance "MD5")]
(.update m (.getBytes (str s)) 0 (count s))
(let [x (.toString (BigInteger. 1 (.digest m)) 16)]
(pad 32 x))))


So how well does this work?


(deftest test-bloom
(let [teststrs (map (fn [x] (str x)) (range 0 1000))
bloom (make-bloom-filter 0xFFFF)]
(doseq [x teststrs]
(is (= 0 (query bloom x)))
(add! bloom x)
(is (= 0 (query bloom (str "not" x))))
(is (query bloom x)))))


In this example, I've used hash functions which break MD5 down into 4 lots of 4 hex characters which gives a range of 65536.

Running this test gives


bloom> (run-tests 'bloom)
Ran 1 tests containing 3000 assertions.
0 failures, 0 errors.


Awesome, no false positives. Taking it down to 0xFFF gives 82 false positives which is inline(ish!) with the figures here Bloom filter error table.

The primary use case is caching - check it's in some storage mechanism before doing something expensive (BigTable and Squid Cache both use bloom filters.).

Rapleaf write about how using a Bloom filter saved some serious time (see here).

5 comments:

  1. Unable to resolve symbol: md5-hashes in this context (bloom_filter.clj:20)

    ReplyDelete
  2. Hmm sorry about that.

    I did have to make a few changes to get it to compile in the latest Clojure code. Clojure Contrib test-is is now in the main package so in both bloom.clj and bitarray.clj you need to make sure you (:use clojure.test) instead of the clojure-contrib.

    You'll also need to make sure that bloom.clj and bitarray.clj are on your Java classpath.

    Let me know if that doesn't resolve things!

    ReplyDelete
  3. Thanks, I did have all that set up as you described. It seems that 'md5-hashes' still needs to be defined ?

    ReplyDelete
  4. Ahem! I missed the function out in the page didn't I? Here it is (excuse poor formatting)

    (def md5-hashes
    (list
    (fn [x] (BigInteger. (apply str (take 3 (md5-hash x))) 16))
    (fn [x] (BigInteger. (apply str (take 3 (drop 4 (md5-hash x)))) 16))
    (fn [x] (BigInteger. (apply str (take 3 (drop 8 (md5-hash x)))) 16))
    (fn [x] (BigInteger. (apply str (take 3 (drop 12 (md5-hash x)))) 16))))

    I've also pushed the full code out to my git repository. Hopefully that'll sort things out! Sorry for the confusion.

    ReplyDelete
  5. Thanks Jeff !

    btw - I've really enjoyed following all your blog posts with Clojure and Haskell.

    -Justin

    ReplyDelete