Fatvat

Thursday, 16 September 2010

Let's get hashing.

The usual job of a hash function is to convert a large bit of data (such as a string) into a simple integer representation. A perfect hash function guarantees that each bit of data you provide to the hash function gives you a unique integer. In the real world, hash functions aren't perfect, but usually the goal is to minimize the different sets of data that has to the same value.

One application for hashes such as MD5 or SHA1 is verification that a downloaded file is the same as the one of the server. A hash signature is given on the server, and this can be verified locally after downloading. A small change in the downloaded item results in a huge change in the computed hash function value. For example using an MD5 hash:

"A Quick Brown Fox" hashes to 138a8e4a3e0fa3d62211e11a08917072

"A Quick Brown Fix" hashes to e6907da1d6917f6ce3befcc628d332f7

This is also great for detecting duplicates - files with the same binary content can be found simply by comparing their checksums. But what if you want to detect near duplicates? Documents that are more or less similar, but not quite?

Locality Sensitive Hashing is a simple algorithm where similar features hash to similar values. Sim-hash [PostScript file] is once such algorithm by Moses Charikar. I hope that the algorithm can be roughly described in the following steps for a feature vector V

Hash each element of the feature vector using a standard hash algorithm
Create a weight vector the same length in bits as the hash value for each hash. Values are set to 1 if the bit is set, and -1 otherwise. Sum these to produce a single weight vector for the whole feature vector
The similarity hash is created by turning this weight vector into a big binary number with >0 meaning a bit is set.

Converting this over to Haskell is pretty simple (especially thanks to this Erlang implementation and this explanation).

I chose to use the Jenkins hash function purely because Real World Haskell has the code for it (see the chapter on Advanced Library Design).

So we can test this out simply now. Similar feature vectors hash to similar values.

  > computeHash (FV "A Quick Brown Fox")
  10515216444980845459

  > computeHash (FV "A Quick Brown Fix")
  10519438844509313944

  > computeHash (FV "Hopefully Different")
  9706485515645876927

Because of the way the hash is calculated, the ordering of the feature vector is irrelevant, so hash [1,2,3] is the same as [3,1,2] and so on. To measure the distance between two hashes, the Hamming Distance is used. This is simply just a count of the number of bits that differ. A naive implementation of this is shown below (using the Data.Bits) package.

This code is easy to follow and looks right, but it's not particularly fast. A much faster way to count the number of bits is shown in the K&R C Programming Language. The implementation in the C language is discussed in more detail here and this translates to the following with the algorithm running in time proportional to the number of bits set.

We can verify that the implementation performs exactly the same functionality by challenging QuickCheck to prove us wrong.

> quickCheck (\x y -> hammingDistance x y == hammingDistance2 x y)
  +++ OK, passed 100 tests.

Note that you'll need an appropriate definition of arbitrary for Word64, such as the one here. Next up, is it actually any faster? Criterion tells me that it's quite a bit faster (about 50%).

benchmarking hamming/Naive
  collecting 100 samples, 229 iterations each, in estimated 956.2516 ms
  bootstrapping with 100000 resamples
  mean: 42.66142 us, lb 42.52916 us, ub 42.95010 us, ci 0.950

  benchmarking hamming/Ritchie
  collecting 100 samples, 388 iterations each, in estimated 955.6707 ms
  bootstrapping with 100000 resamples
  mean: 24.59006 us, lb 24.53693 us, ub 24.65282 us, ci 0.950

Ok, basics sorted, so what can we do with sim-hash? One application is near duplicate detection, and to test this out I grabbed a dataset from RIDDLE with the Zagat and Fodor food guides and the associated restaurant address. The addresses are similar, but not quite. For example

  "Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 Californian"
 
    vs.

  "Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 310/472-1211 Californian\STX"

Pretty similar but not quite the same. A quick (and hugely inefficient - I compare everything against everything) function to compare the two is shown below, which returns the list of all near dupes according to the hamming distance specified.

A more efficient implementation of navigating the hamming space is given in the paper Detecting Near-Duplicates for Web Crawling [PDF] by by Gurmeet Manku, Arvind Jain, and Anish Sarma. For this little daft program, I can tolerate waiting a few seconds!

According to the readme include with the data set [tar.gz] there are 112 matches between the two sets of data. Running with various hamming distances I get the following results:

  -- Compare the lists, deduplicate and run for various hamming distances
 > map (\x -> length $ Data.List.nub (map fst $ compareLists zagats fodors x)) [0..5]
 [3,9,27,59,124,197]

So with a Hamming distance of 0, three results are returned (i.e. there are only 3 exact duplicates between the two data sources), whereas with a Hamming distance of 5 then way too many duplicates are returned. Eye-balling the results, a Hamming distance of 2 seems to give the best set of actual closest matches (but still with a fair few false matches).

  -- Good
  "Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 Californian",
  "Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 310/472-1211 Californian\STX"

  -- Good
  "Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 818-788-3536 French Bistro",
  "Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 818/788-3536 French\STX"

  -- False match
  "Cassell's 3266 W. Sixth St. LA 213-480-8668 Hamburgers",
  "Jack Sprat's Grill 10668 W. Pico Blvd. Los Angeles 310/837-6662 Health Food\STX"

   -- Good
  "Granita 23725 W. Malibu Rd. Malibu 310-456-0488 Californian",
  "Granita 23725 W. Malibu Rd. Malibu 310/456-0488 Californian\STX"

This is almost certainly because the choice of the ASCII values of each item in the string is a poor choice of feature vector! Given that we know the data set is a list of addresses, we could take advantage and cheat a little bit and base the feature vector solely on the numbers. Using just the numbers gives 95 perfect matches with a Hamming distance of zero. A hamming distance of 1 covers all the matches (and some false matches too - since it's order invariant and there aren't a whole host of numbers). Sim-hash is better suited for large documents with decent feature vectors!

Wednesday, 8 September 2010

The Beginnings of a 6502 Emulator in Haskell

It's been ages since I've written some assembly code, but I'm also enjoying learning Haskell. Therefore, my next random coding exercise is to code a simple CPU emulator.

The 6502 Processor is an 8-bit processor introduced in 1975 and it's still used in embedded systems. It was a hugely popular chip used by such classic machines as the Atari 2600, the original Nintendo Entertainment System and the BBC Micro.

The 6502 Micro Processor!

A CPU is defined by its instruction set architecture (ISA). According to Computer Architecture - A Quantitative Approach (seems to be the authoritative book on Computer Architecture) an ISA is defined by:

Class - Most processors today are general-purpose register architectures where operands are either registers or memory locations. General purpose GPUs are a different class of ISA
Memory Addressing - Almost every processor uses byte addressing to access memory. Alignment can be an issue for performance and some processors such as the Itanium have more specific alignment requirements
Addressing Modes - An ISA can specify various ways of addressing memory. Examples include register, constant and displacement. The 6502 processor supports a dozen or so different addressing modes. A modern day Intel 64-bit processor supports even more including another level of indirection known as segment addressing.
Types and Sizes of operands - As an 8bit processor the 6502 just supports 8 bit operands. A more sophisticated architecture like the 80x86 supports various sizes of integers and floating point
Operations - There used to be a divide amongst RISC / CISC. Now it seems most modern processors are a combination of the two, though I guess GPUs are the emergence of simple instructions again?
Control Flow Instructions - The various choices for branching. The 6502 supports a simple selection of branching instructions that move the program counter based on some arithmetic test
Encoding refers to how the assembler instructions are encoded into byte code. There are two choices here, fixed length encoding which is easier to read but may take up more space and variable length encoding which is slower to deal with.

Thankfully the 6502 is one of the simplest processors available. After reading this great description of the registers, I modelled the CPU with the following simple data structure.

The RAM (a maximum addressing space of 64Kb) is a mutable Data.Vector. The CPU consists of a number of registers. The program counter indicates where the next instruction to execute is going to come from. Jump instructions move the program counter to a new location. The xr and yr registers are commonly used to hold offsets or counters for memory addressing. The accumulator (ac) register is used by most arithmetic and logical operations. The status register contains various flags represented by Flag. These flags can be set as instructions are executed and are hopefully self-explanatory. Finally the stack pointer contains a pointer to the next free location on the stack. The stack is held within a 256 byte stack between 0x0100 and 0x01FF.

In order to access the memory we need to understand how the various addressing modes work, again the 6502 site provides a very clear description.

Accumulator indicates that the instruction works directly on the accumulator. Immediate is an 8 bit constant value within the constructor. In assembler this is usually indicated with #VAL. A ZeroPage address is a byte offset relative to 0, so this only allows indexing into the first 256 bytes of memory. ZeroPageX and ZeroPageY addressing modes use a zero page address together with the offset specified in the xr and yr registers. Relative addressing is only used by the branch instructions and gives a signed 8 bit number to indicate where the program counter should jump to. Absolute, AbsoluteX and AbsoluteY are like ZeroPage addressing but allow access to the full memory range because it supports a full 16 bit address range. Finally there are 3 indirect addressing modes. Indirect gives a 16 bit address which identifies the location of the LSB of another byte that contains the real target of the instruction. IndexedIndirect and IndirectIndexed is used similar to indirect but uses the xr and yr registers to index with offset.

Once we've got our address mode we need a handful of functions to read from the various memory addresses.

These functions do exactly as they say on the tin. With a few small exceptions... For example, readWord16 should never be called with an AddressMode of accumulator (you can read 16 bits from an 8 bit value). For now I've just left these functions with "error" definitions until I can think of a better way of expressing it.

After getting these bits and pieces in place it's time to look at the CPU instructions, apparently developed by Cyberdyne Systems and used in "The Terminator".

Instructions consist of a three letter op code, together with an optional argument. For example, LDA loads the supplied value into the accumulator and sets the zero or negative fields appropriately. For some simple instructions no argument is needed (for example, CLC clears the carry flag but requires no argument).

Currently, I've represented instructions as just the three letter mnemonic and an optional address mode. In the future I'll try and make it more a specification by including the flags the operation is allowed to set, together with restricting to only the allowed addressing mode (for example, some instructions don't support all addressing modes, LDA wouldn't make much sense with a relative address).

Armed with this, all that needs to be done is implement the various operations. A simple function execute :: CPU -> Instruction -> IO () simply matches the instruction and executes it.

Finally after all that we can execute a really simple lump of code to see if it works.

In the example above we load a value into the accumulator. Shift it left (i.e. multiply by 2), store this value in some location in memory. Shift it left again (multiply by 4) and left once more (multiple by 8). We then add the original multipled by 8 with the value in the accumulator. For this really simple lump of code, it seems to work :)

My next plans are to:

Make things more type-safe (e.g. no wrong addressing modes, no modifying the wrong flags)
Write a simple parser using Parsec so that I can write little bits of assembler in a nicer syntax
Finish implementing the remainder of the operations (need to get the design right before I go much further)
Encode the instructions so that I actually use the program counter and potentially load in pre-existing code.
Adding some IO to actually make it useful...

Sunday, 29 August 2010

Speeding up the Ants program

In my previous post, I looked at implementing Ants in Haskell. As comments on the blog indicated, things started off OK, but performance rapidly got worse. This is the first time I've had this kind of problem before, so I thought I'd document how I started to solve it.

Haskell has a huge range of performance testing tools to help find out what goes wrong. Real World Haskell has a greater chapter on "Profiling and Optimization" that helps identify the tools available.

The first tool in the chain is simply to run the application and collect some basic statistics about the runtime. You can do this with the command ./AntsVis +RTS -sstderr. Anything after the "+RTS" is a Haskell runtime parameter. This produced the following output:



1,835,837,744 bytes allocated in the heap

328,944,448 bytes copied during GC

2,908,728 bytes maximum residency (25 sample(s))

142,056 bytes maximum slop

9 MB total memory in use (1 MB lost due to fragmentation)



Generation 0:  3483 collections,     0 parallel,  1.54s,  1.54s elapsed

Generation 1:    25 collections,     0 parallel,  0.09s,  0.07s elapsed



INIT  time    0.00s  (  0.00s elapsed)

MUT   time    3.04s  (  3.13s elapsed)

GC    time    1.63s  (  1.61s elapsed)

RP    time    0.00s  (  0.00s elapsed)

PROF  time    0.00s  (  0.00s elapsed)

EXIT  time    0.00s  (  0.00s elapsed)

Total time    4.67s  (  4.74s elapsed)



%GC time      34.9%  (34.0% elapsed)



Alloc rate    603,893,994 bytes per MUT second



Productivity  65.1% of total user, 64.2% of total elapsed

Ouch... The program spent 35% of the time doing garbage collection and only 65% of the time doing useful stuff. It also allocated 603MB of memory / second! That's clearly not acceptable and I'm definitely allocating more memory than I should!

I lack the intuition to understand where the problems are from looking at the code, so time to break out the profiling program. GHC gives quite a few compiler options, but for profiling these seem to be the important ones:

-prof - Enables profiling
-caf-all - Constant Applicative form for all top-level items (constant costs, one for each module.)
-auto-all - Cost-centre analysis for every top-level function

After you compile with those functions, you can run the program again with ./AntsVis +RTS -hc -p to get heap information together with profiling. That gives me the following picture. Flat things are good, things going up at a steep gradient and bad.

First Run of the Profiling Tool

As you might be able to see from the image (but click to see the full one) the finger is pointed squarely at the chain of functions involving updateTVar and evaporate, the code for which is shown below:

So how come this tiny inoffensive lump of code is allocating a fuck-ton of memory? The blame lies solely in the record update function (c {pheromone = pheromone c * evapRate}). Laziness means that this isn't evaluated fully, which means it keeps a reference to the previous value of the cell. I'm not interested in the value, but because there is a reference to it, the runtime can't though the value away. Time for some strictness annotations. Note that I've also done a few refactorings (the World data type was just a wrapper around Vector, so I got rid of the wrapper, and typing pheromone in 100 times was driving me nuts).

The seq primitive is of type a -> b -> b and evaluates the first argument to head-normal form and returns the second. Hopefully, this should force the evaluation of the cells and free the memory up immediately. Running with the profiler again gives a much better picture:

A little bit of strictness added

Much better! Maximum memory usage is significantly down and when I run it my CPU usage is significantly down. seq isn't a panacea though. Earlier attempts at adding bang patterns everywhere led me astray - if you make everything strict, then you can get unintended consequences (I was having problems with fromJust being called on Nothing - not entirely sure why). I think I'll stick to putting ~~seq~~ strict annotations based on the information from the profiling tools, at least I until I have a better intuition for how it will affect the final result.

Updates thanks to very helpful comments from augustss, don and gtllama!

seq is not the right way to achieve the strictness, see comments from Don Stewart about funbox-strict-fields and ! patterns for strictness.

I've removed the Criterion graphs as I don't think the way I generated them was reliable. I believe my use of "seq" was completely bogus and laziness bit me. The faster times almost certainly came from not evaluating the expression correctly once seq was in place. Rather embarrassing!. From the comments, "Computing a value of type STM () does not execute it. STM code is only executed if it "meets up with" an 'atomically' that is executed."

I also go to the bottom of the slowness updating. Most definitely not an STM bug. The problem occured because I was trying to do a single STM transaction to do evaporation - this meant that evaporation could only succeed if nothing else wrote over the pheromones used in the evaporation. Since many ants are marching and dropping pheromones this seems to mean that this transaction would need to be retried many times. Switching this over to use a sequence of atomic operations (one per cell) removed this problem. Performance is now consistent and when I run with profiling I get a flat heap output as shown below:

Flat as a Pancake

This was definitely the most complicated Haskell program I've attempted so far, but also the one I've learnt the most from! Probably still some more mistakes in the code to find, but I've pushed the latest code to the repository.