Thursday 16 September 2010

Let's get hashing.

The usual job of a hash function is to convert a large bit of data (such as a string) into a simple integer representation.  A perfect hash function guarantees that each bit of data you provide to the hash function gives you a unique integer.  In the real world, hash functions aren't perfect, but usually the goal is to minimize the different sets of data that has to the same value.

One application for hashes such as MD5 or SHA1 is verification that a downloaded file is the same as the one of the server.  A hash signature is given on the server, and this can be verified locally after downloading.  A small change in the downloaded item results in a huge change in the computed hash function value.  For example using an MD5 hash:
"A Quick Brown Fox" hashes to 138a8e4a3e0fa3d62211e11a08917072
"A Quick Brown Fix" hashes to  e6907da1d6917f6ce3befcc628d332f7
This is also great for detecting duplicates - files with the same binary content can be found simply by comparing their checksums.  But what if you want to detect near duplicates?  Documents that are more or less similar, but not quite?

Locality Sensitive Hashing is a simple algorithm where similar features hash to similar values.  Sim-hash [PostScript file] is once such algorithm by Moses Charikar.  I hope that the algorithm can be roughly described in the following steps for a feature vector V

  1. Hash each element of the feature vector using a standard hash algorithm 
  2. Create a weight vector the same length in bits as the hash value for each hash.  Values are set to 1 if the bit is set, and -1 otherwise.  Sum these to produce a single weight vector for the whole feature vector
  3. The similarity hash is created by turning this weight vector into a big binary number with >0 meaning a bit is set.

Converting this over to Haskell is pretty simple (especially thanks to this Erlang implementation and this explanation).

I chose to use the Jenkins hash function purely because Real World Haskell has the code for it (see the chapter on Advanced Library Design).

So we can test this out simply now. Similar feature vectors hash to similar values.

  > computeHash (FV "A Quick Brown Fox")

  > computeHash (FV "A Quick Brown Fix")

  > computeHash (FV "Hopefully Different")

Because of the way the hash is calculated, the ordering of the feature vector is irrelevant, so hash [1,2,3] is the same as [3,1,2] and so on. To measure the distance between two hashes, the Hamming Distance is used. This is simply just a count of the number of bits that differ. A naive implementation of this is shown below (using the Data.Bits) package.

This code is easy to follow and looks right, but it's not particularly fast. A much faster way to count the number of bits is shown in the K&R C Programming Language. The implementation in the C language is discussed in more detail here and this translates to the following with the algorithm running in time proportional to the number of bits set.

We can verify that the implementation performs exactly the same functionality by challenging QuickCheck to prove us wrong.

> quickCheck (\x y -> hammingDistance x y == hammingDistance2 x y)
  +++ OK, passed 100 tests.

Note that you'll need an appropriate definition of arbitrary for Word64, such as the one here. Next up, is it actually any faster? Criterion tells me that it's quite a bit faster (about 50%).

benchmarking hamming/Naive
  collecting 100 samples, 229 iterations each, in estimated 956.2516 ms
  bootstrapping with 100000 resamples
  mean: 42.66142 us, lb 42.52916 us, ub 42.95010 us, ci 0.950

  benchmarking hamming/Ritchie
  collecting 100 samples, 388 iterations each, in estimated 955.6707 ms
  bootstrapping with 100000 resamples
  mean: 24.59006 us, lb 24.53693 us, ub 24.65282 us, ci 0.950

Ok, basics sorted, so what can we do with sim-hash? One application is near duplicate detection, and to test this out I grabbed a dataset from RIDDLE with the Zagat and Fodor food guides and the associated restaurant address. The addresses are similar, but not quite. For example

  "Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 Californian"

  "Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 310/472-1211 Californian\STX"

Pretty similar but not quite the same. A quick (and hugely inefficient - I compare everything against everything) function to compare the two is shown below, which returns the list of all near dupes according to the hamming distance specified.

A more efficient implementation of navigating the hamming space is given in the paper Detecting Near-Duplicates for Web Crawling [PDF] by by Gurmeet Manku, Arvind Jain, and Anish Sarma. For this little daft program, I can tolerate waiting a few seconds!

According to the readme include with the data set [tar.gz] there are 112 matches between the two sets of data. Running with various hamming distances I get the following results:
  -- Compare the lists, deduplicate and run for various hamming distances
 > map (\x -> length $ Data.List.nub (map fst $ compareLists zagats fodors x)) [0..5]

So with a Hamming distance of 0, three results are returned (i.e. there are only 3 exact duplicates between the two data sources), whereas with a Hamming distance of 5 then way too many duplicates are returned. Eye-balling the results, a Hamming distance of 2 seems to give the best set of actual closest matches (but still with a fair few false matches).

  -- Good
  "Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 Californian",
  "Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 310/472-1211 Californian\STX"

  -- Good
  "Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 818-788-3536 French Bistro",
  "Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 818/788-3536 French\STX"

  -- False match
  "Cassell's 3266 W. Sixth St. LA 213-480-8668 Hamburgers",
  "Jack Sprat's Grill 10668 W. Pico Blvd. Los Angeles 310/837-6662 Health Food\STX"

   -- Good
  "Granita 23725 W. Malibu Rd. Malibu 310-456-0488 Californian",
  "Granita 23725 W. Malibu Rd. Malibu 310/456-0488 Californian\STX"

This is almost certainly because the choice of the ASCII values of each item in the string is a poor choice of feature vector! Given that we know the data set is a list of addresses, we could take advantage and cheat a little bit and base the feature vector solely on the numbers. Using just the numbers gives 95 perfect matches with a Hamming distance of zero. A hamming distance of 1 covers all the matches (and some false matches too - since it's order invariant and there aren't a whole host of numbers). Sim-hash is better suited for large documents with decent feature vectors!

Wednesday 8 September 2010

The Beginnings of a 6502 Emulator in Haskell

It's been ages since I've written some assembly code, but I'm also enjoying learning Haskell. Therefore, my next random coding exercise is to code a simple CPU emulator.

The 6502 Processor is an 8-bit processor introduced in 1975 and it's still used in embedded systems. It was a hugely popular chip used by such classic machines as the Atari 2600, the original Nintendo Entertainment System and the BBC Micro.
The 6502 Micro Processor!

A CPU is defined by its instruction set architecture (ISA). According to Computer Architecture - A Quantitative Approach (seems to be the authoritative book on Computer Architecture) an ISA is defined by:

  1. Class  - Most processors today are general-purpose register architectures where operands are either registers or memory locations.  General purpose GPUs are a different class of ISA
  2. Memory Addressing - Almost every processor uses byte addressing to access memory.  Alignment can be an issue for performance and some processors such as the Itanium have more specific alignment requirements
  3. Addressing Modes - An ISA can specify various ways of addressing memory.  Examples include register, constant and displacement.  The 6502 processor supports a dozen or so different addressing modes.  A modern day Intel 64-bit processor supports even more including another level of indirection known as segment addressing. 
  4. Types and Sizes of operands - As an 8bit processor the 6502 just supports 8 bit operands.  A more sophisticated architecture like the 80x86 supports various sizes of integers and floating point
  5. Operations - There used to be a divide amongst RISC / CISC. Now it seems most modern processors are a combination of the two, though I guess GPUs are the emergence of simple instructions again?
  6. Control Flow Instructions - The various choices for branching.  The 6502 supports a simple selection of branching instructions that move the program counter based on some arithmetic test
  7. Encoding refers to how the assembler instructions are encoded into byte code.  There are two choices here, fixed length encoding which is easier to read but may take up more space and variable length encoding which is slower to deal with.
Thankfully the 6502 is one of the simplest processors available.  After reading this great description of the registers, I modelled the CPU with the following simple data structure.

The RAM (a maximum addressing space of 64Kb) is a mutable Data.Vector. The CPU consists of a number of registers.  The program counter indicates where the next instruction to execute is going to come from.  Jump instructions move the program counter to a new location. The xr and yr registers are commonly used to hold offsets or counters for memory addressing. The accumulator (ac) register is used by most arithmetic and logical operations. The status register contains various flags represented by Flag. These flags can be set as instructions are executed and are hopefully self-explanatory. Finally the stack pointer contains a pointer to the next free location on the stack. The stack is held within a 256 byte stack between 0x0100 and 0x01FF.

In order to access the memory we need to understand how the various addressing modes work, again the 6502 site provides a very clear description.

Accumulator indicates that the instruction works directly on the accumulator. Immediate is an 8 bit constant value within the constructor. In assembler this is usually indicated with #VAL. A ZeroPage address is a byte offset relative to 0, so this only allows indexing into the first 256 bytes of memory. ZeroPageX and ZeroPageY addressing modes use a zero page address together with the offset specified in the xr and yr registers. Relative addressing is only used by the branch instructions and gives a signed 8 bit number to indicate where the program counter should jump to. Absolute, AbsoluteX and AbsoluteY are like ZeroPage addressing but allow access to the full memory range because it supports a full 16 bit address range. Finally there are 3 indirect addressing modes. Indirect gives a 16 bit address which identifies the location of the LSB of another byte that contains the real target of the instruction. IndexedIndirect and IndirectIndexed is used similar to indirect but uses the xr and yr registers to index with offset.

Once we've got our address mode we need a handful of functions to read from the various memory addresses.

These functions do exactly as they say on the tin. With a few small exceptions... For example, readWord16 should never be called with an AddressMode of accumulator (you can read 16 bits from an 8 bit value). For now I've just left these functions with "error" definitions until I can think of a better way of expressing it.

After getting these bits and pieces in place it's time to look at the CPU instructions, apparently developed by Cyberdyne Systems and used in "The Terminator".

Instructions consist of a three letter op code, together with an optional argument.  For example, LDA loads the supplied value into the accumulator and sets the zero or negative fields appropriately. For some simple instructions no argument is needed (for example, CLC clears the carry flag but requires no argument).

Currently, I've represented instructions as just the three letter mnemonic and an optional address mode. In the future I'll try and make it more a specification by including the flags the operation is allowed to set, together with restricting to only the allowed addressing mode (for example, some instructions don't support all addressing modes, LDA wouldn't make much sense with a relative address).

Armed with this, all that needs to be done is implement the various operations. A simple function execute :: CPU -> Instruction -> IO () simply matches the instruction and executes it.

Finally after all that we can execute a really simple lump of code to see if it works.

In the example above we load a value into the accumulator. Shift it left (i.e. multiply by 2), store this value in some location in memory. Shift it left again (multiply by 4) and left once more (multiple by 8). We then add the original multipled by 8 with the value in the accumulator. For this really simple lump of code, it seems to work :)

My next plans are to:

  • Make things more type-safe (e.g. no wrong addressing modes, no modifying the wrong flags)
  • Write a simple parser using Parsec so that I can write little bits of assembler in a nicer syntax
  • Finish implementing the remainder of the operations (need to get the design right before I go much further)
  • Encode the instructions so that I actually use the program counter and potentially load in pre-existing code.
  • Adding some IO to actually make it useful...