Tuesday, 24 February 2009

Bloom Filters

Bloom Filters are an efficient data structure for determining whether an item is a member of a set. It's a probabilistic set, it's guaranteed to never return a false negative BUT can sometimes falsely report that an item is in the set.

The Bloom filter doesn't store the data in the structure, instead it uses a bit array. There are two operations on a Bloom filter:

  1. Add - as the name suggests adds a new element to the filter.
  2. Query - returns whether an item is in the set or not

A series of hash functions (ideally independent) are used to calculate a number of indices within the bit-array. When we add something we set the various bit indices, and when we query we check whether all these bit indices are set.

(defstruct bloom-filter :hashfns :value)

(defn make-bloom-filter
([n] (struct bloom-filter md5-hashes (bit-array n)))
([n fns] (struct bloom-filter fns (bit-array n))))

(defn add!
[bloom n]
(let [hashes (map (fn [x] (x n)) (bloom :hashfns))]
(doseq [x hashes] (set-bit! (bloom :value) x 1))

(defn query
[bloom n]
(let [hashes (map (fn [x] (x n)) (bloom :hashfns))]
(reduce bit-and (map (fn [z] (get-bit (bloom :value) z)) hashes))))

The gotcha for me was remembering to use doseq for side effects. If instead I'd used map I'd have (and was) in trouble because it wasn't evaluated. doseq forces the evaluation.

One simple choice for the hashes is to use MD5 hash values and split it. MessageDigest allows you to calculate various hash functions.

(ns bloom
(:use bitarray)
(:use clojure.contrib.test-is)
(:import (java.security MessageDigest)))

(defn pad [n s]
(let [padding (- n (count s))]
(apply str (concat (apply str (repeat padding "0")) s))))

(defn md5-hash [s]
(let [m (MessageDigest/getInstance "MD5")]
(.update m (.getBytes (str s)) 0 (count s))
(let [x (.toString (BigInteger. 1 (.digest m)) 16)]
(pad 32 x))))

So how well does this work?

(deftest test-bloom
(let [teststrs (map (fn [x] (str x)) (range 0 1000))
bloom (make-bloom-filter 0xFFFF)]
(doseq [x teststrs]
(is (= 0 (query bloom x)))
(add! bloom x)
(is (= 0 (query bloom (str "not" x))))
(is (query bloom x)))))

In this example, I've used hash functions which break MD5 down into 4 lots of 4 hex characters which gives a range of 65536.

Running this test gives

bloom> (run-tests 'bloom)
Ran 1 tests containing 3000 assertions.
0 failures, 0 errors.

Awesome, no false positives. Taking it down to 0xFFF gives 82 false positives which is inline(ish!) with the figures here Bloom filter error table.

The primary use case is caching - check it's in some storage mechanism before doing something expensive (BigTable and Squid Cache both use bloom filters.).

Rapleaf write about how using a Bloom filter saved some serious time (see here).