Thursday, 16 September 2010

Let's get hashing.

The usual job of a hash function is to convert a large bit of data (such as a string) into a simple integer representation.  A perfect hash function guarantees that each bit of data you provide to the hash function gives you a unique integer.  In the real world, hash functions aren't perfect, but usually the goal is to minimize the different sets of data that has to the same value.

One application for hashes such as MD5 or SHA1 is verification that a downloaded file is the same as the one of the server.  A hash signature is given on the server, and this can be verified locally after downloading.  A small change in the downloaded item results in a huge change in the computed hash function value.  For example using an MD5 hash:
"A Quick Brown Fox" hashes to 138a8e4a3e0fa3d62211e11a08917072
"A Quick Brown Fix" hashes to  e6907da1d6917f6ce3befcc628d332f7
This is also great for detecting duplicates - files with the same binary content can be found simply by comparing their checksums.  But what if you want to detect near duplicates?  Documents that are more or less similar, but not quite?

Locality Sensitive Hashing is a simple algorithm where similar features hash to similar values.  Sim-hash [PostScript file] is once such algorithm by Moses Charikar.  I hope that the algorithm can be roughly described in the following steps for a feature vector V

  1. Hash each element of the feature vector using a standard hash algorithm 
  2. Create a weight vector the same length in bits as the hash value for each hash.  Values are set to 1 if the bit is set, and -1 otherwise.  Sum these to produce a single weight vector for the whole feature vector
  3. The similarity hash is created by turning this weight vector into a big binary number with >0 meaning a bit is set.

Converting this over to Haskell is pretty simple (especially thanks to this Erlang implementation and this explanation).

I chose to use the Jenkins hash function purely because Real World Haskell has the code for it (see the chapter on Advanced Library Design).

So we can test this out simply now. Similar feature vectors hash to similar values.

  > computeHash (FV "A Quick Brown Fox")

  > computeHash (FV "A Quick Brown Fix")

  > computeHash (FV "Hopefully Different")

Because of the way the hash is calculated, the ordering of the feature vector is irrelevant, so hash [1,2,3] is the same as [3,1,2] and so on. To measure the distance between two hashes, the Hamming Distance is used. This is simply just a count of the number of bits that differ. A naive implementation of this is shown below (using the Data.Bits) package.

This code is easy to follow and looks right, but it's not particularly fast. A much faster way to count the number of bits is shown in the K&R C Programming Language. The implementation in the C language is discussed in more detail here and this translates to the following with the algorithm running in time proportional to the number of bits set.

We can verify that the implementation performs exactly the same functionality by challenging QuickCheck to prove us wrong.

> quickCheck (\x y -> hammingDistance x y == hammingDistance2 x y)
  +++ OK, passed 100 tests.

Note that you'll need an appropriate definition of arbitrary for Word64, such as the one here. Next up, is it actually any faster? Criterion tells me that it's quite a bit faster (about 50%).

benchmarking hamming/Naive
  collecting 100 samples, 229 iterations each, in estimated 956.2516 ms
  bootstrapping with 100000 resamples
  mean: 42.66142 us, lb 42.52916 us, ub 42.95010 us, ci 0.950

  benchmarking hamming/Ritchie
  collecting 100 samples, 388 iterations each, in estimated 955.6707 ms
  bootstrapping with 100000 resamples
  mean: 24.59006 us, lb 24.53693 us, ub 24.65282 us, ci 0.950

Ok, basics sorted, so what can we do with sim-hash? One application is near duplicate detection, and to test this out I grabbed a dataset from RIDDLE with the Zagat and Fodor food guides and the associated restaurant address. The addresses are similar, but not quite. For example

  "Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 Californian"

  "Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 310/472-1211 Californian\STX"

Pretty similar but not quite the same. A quick (and hugely inefficient - I compare everything against everything) function to compare the two is shown below, which returns the list of all near dupes according to the hamming distance specified.

A more efficient implementation of navigating the hamming space is given in the paper Detecting Near-Duplicates for Web Crawling [PDF] by by Gurmeet Manku, Arvind Jain, and Anish Sarma. For this little daft program, I can tolerate waiting a few seconds!

According to the readme include with the data set [tar.gz] there are 112 matches between the two sets of data. Running with various hamming distances I get the following results:
  -- Compare the lists, deduplicate and run for various hamming distances
 > map (\x -> length $ Data.List.nub (map fst $ compareLists zagats fodors x)) [0..5]

So with a Hamming distance of 0, three results are returned (i.e. there are only 3 exact duplicates between the two data sources), whereas with a Hamming distance of 5 then way too many duplicates are returned. Eye-balling the results, a Hamming distance of 2 seems to give the best set of actual closest matches (but still with a fair few false matches).

  -- Good
  "Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 Californian",
  "Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 310/472-1211 Californian\STX"

  -- Good
  "Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 818-788-3536 French Bistro",
  "Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 818/788-3536 French\STX"

  -- False match
  "Cassell's 3266 W. Sixth St. LA 213-480-8668 Hamburgers",
  "Jack Sprat's Grill 10668 W. Pico Blvd. Los Angeles 310/837-6662 Health Food\STX"

   -- Good
  "Granita 23725 W. Malibu Rd. Malibu 310-456-0488 Californian",
  "Granita 23725 W. Malibu Rd. Malibu 310/456-0488 Californian\STX"

This is almost certainly because the choice of the ASCII values of each item in the string is a poor choice of feature vector! Given that we know the data set is a list of addresses, we could take advantage and cheat a little bit and base the feature vector solely on the numbers. Using just the numbers gives 95 perfect matches with a Hamming distance of zero. A hamming distance of 1 covers all the matches (and some false matches too - since it's order invariant and there aren't a whole host of numbers). Sim-hash is better suited for large documents with decent feature vectors!