Fatvat: 2010

Saturday 11 December 2010

Word Ladders

Word Ladders were invented by Lewis Carroll as a form of word play. The idea is that you start with a word, change one letter at a time, and end up with a new word. All words in the chain must be valid words in the dictionary (otherwise it'd be a bit rubbish). For example, you can turn beer into wine with the following

  BEER
  BEAR
  BEAD
  BEAK
  BEAT
  BELT
  WELT
  WILT
  WILE
  WINE

Let's write a simple program that calculates word ladders for a simplified version of the game where each change can only change a single letter at a time (rather than allow deletes and inserts too). Finding a word ladder can be thought about as a graph problem. Each node is a word, and each edge represents a connection to another valid word. The graph below is a partial representation of the graph starting with BEER. The real graph is much more complicated!

First things first, how do we build the graph? For starters we need to know if two words are neighbours or not. The following simple functions determine the difference between two strings. Remember that I'm only working on the simplest possible distance metric at the moment, that is allowing a single character to change.

neighbour :: String -> String -> Bool
  neighbour x y = difference x y == 1
                                     
  difference :: String -> String -> Int                     
  difference [] [] = 0
  difference (x:xs) (y:ys) | x == y = difference xs ys
                           | otherwise = 1 + difference xs ys
  difference _ _ = error "Two strings must be the same length"

Next thing we need to grab is a huge list of words. We'll store these words in a Set because we'll want to test for membership frequently (an O(lg(N)) operation. An alternative would be to use a perfect hash (this is an option because the dictionary is fixed). This would give O(1) lookup times, and it looks like there is (as almost always) a library on Hackage that does just that. The simple distance metric chosen means we can limit the number of words based on the size of the input word.

import qualified Data.Set as S
  type WordSet = S.Set String

  wordListPath :: String
  wordListPath = "/usr/share/dict/british-english"

  createDictionary :: Int -> IO WordSet    
  createDictionary n = do
    file <- readFile wordListPath 
    return $ S.fromList $ filter (\x -> length x == n && all isAlpha x) (map (map toLower) $ words file)

Once we've got a dictionary, all we need to do is build the graph. Since Haskell is lazy, we don't need to worry about the space complexity of the graph - we'll just build it lazily and only the bit that is explored will be resident in memory.

Each node contains the word it represents and the links to the child elements. The graph is built by starting at a root, and filling all the valid neighbours. Each time we place a word in the graph we remove it from the dictionary, otherwise we'll get cycles in the graph.

data Node = Node String [Node] deriving Show

  buildGraph :: WordSet -> String -> Node 
  buildGraph wordset top = Node top (map (buildGraph smaller) neighbours)
    where
      neighbours = S.toList (S.filter (neighbour top) smaller)
      smaller = S.delete top wordset

The graph is *huge*, so we need to find some way to limit the search space. The most obvious way is to give a restriction on the depth of the search. A word ladder that is 10000 words rungs high is probably not much fun to complete. We can also cut the search short if the word is too many changes away given the maximum depth (for example, if 4 characters need to change in a 5 letter word and the maximum left to search is 3 then we can prune this search branch).

search :: Node -> Int -> String -> [String]
  search graph maxDepth goal = search' graph maxDepth goal []

  search' :: Node -> Int -> String -> [String] -> [String]
  search' (Node end children) maxDepth goal path 
    | end == goal    = end : path 
    | null children  = [] 
    | length path >= maxDepth = [] -- too deep
    | difference end goal >= maxDepth - length path = [] -- too much difference
    | otherwise = first
      where
        childRoutes = filter (not . null) $ 
                      map (\child -> search' child maxDepth goal (end : path)) children
        first | null childRoutes = []
              | otherwise        = head childRoutes          
        quickest | null childRoutes = []
                 | otherwise = minimumBy (comparing length) childRoutes

The way we search the children is important. In this case we've used first gone for the first available route that satisfies the depth guarantee, but isn't guaranteed to be the shortest route. quickest on the other hand calculates all child routes and finds the minimum length part.

Finally, we can put this all together and write a simple search search function.

makeLadder :: Int-> String -> String -> IO [String]
  makeLadder maxDepth start end 
    | length start /= length end = error "Only two strings of equal length are currently supported."
    | otherwise = do    
        dict <- createDictionary (length start)
        if (S.member start dict && S.member end dict)
          then return $ search (buildGraph dict start) maxDepth end
          else return []

The complete code for this version available on my git hub repo here. This version has several problems.

It's too slow - searching for the minimal path can take considerable time
It's not very flexible

(this part completely edited from the original post, thanks to insightful comment about something *very* daft I was doing) Let's fix that. In order to be flexible, we need to support various distance metrics. For example, it'd be nice to allow insertions and deletions as well as character transposition. We just need a generic function that calculates the distance between any given strings. In order to improve performance, instead of searching the entire dictionary to see whether any are neighbours, we can generate the neighbours and find out which ones are in the dictionary. That means we need both a distance metric and a way of calculating the edits.

  data DistanceMetric = DistanceMetric (Word -> Word -> Int) (Word -> WordSet)

  difference :: Word -> Word -> Int                     
  difference x y 
    | length x /= length y = 999999
    | otherwise = sum $ zipWith (\c1 c2 -> if c1 == c2 then 0 else 1) x y
                
  transposeChar :: Word -> [Word] 
  transposeChar [] = []
  transposeChar (x:xs) = map (:xs) (validChars \\ [x])
                
  deleteChar :: Word -> [Word]                       
  deleteChar [] = []
  deleteChar (x:xs) = [xs]

  insertChar :: Word -> [Word]
  insertChar [] = []
  insertChar (x:xs) = map (\y -> y:x:xs) validChars

  differenceEdit :: Word -> WordSet
  differenceEdit x = edit' x [transposeChar]
    
  editDistanceEdits :: Word -> WordSet
  editDistanceEdits x = edit' x [insertChar,transposeChar,deleteChar]

  edit' :: Word -> [Word -> [Word]] -> WordSet
  edit' w fns = S.fromList $ concat $ 
                zipWith (\x y -> map (\z -> x ++ z) (concatMap (\x -> x y) fns)) 
                (inits w) (tails w)

  simple :: DistanceMetric
  simple = DistanceMetric difference differenceEdit

  edits :: DistanceMetric
  edits = DistanceMetric editDistance editDistanceEdits

This gives two distance functions and two ways of generating edits. The Levenshtein distance of 1 is generated by transposing, deleting and inserting characters from the original word. This gives us the flexibility, because another distance metric could be put in place (anagrams perhaps?). Next to performance.

  buildGraph :: DistanceMetric -> WordSet -> Word -> Node 
  buildGraph d@(DistanceMetric dist edits) wordset top = Node top (map (buildGraph d smaller) neighbours)
    where
      possibleNeighbours = edits top
      neighbours = S.toList (smaller `S.intersection` possibleNeighbours)
      smaller = S.delete top wordset 
    
  search :: DistanceMetric -> Node -> Int -> Word -> [Word]
  search (DistanceMetric dist _) graph maxDepth goal = search' graph []
    where 
      search' (Node end children) path 
        | end == goal    = end : path 
        | length path >= maxDepth = [] -- too deep
        | dist end goal >= maxDepth - length path = [] -- too much difference
        | otherwise = first
          where
            -- Find the best node to search by comparing it against the goal
            costForNextChild :: [(Int,Node)]
            costForNextChild = zip (map (\(Node x _) -> dist x goal) children) children
            bestFirst = map snd $ sortBy (comparing fst) costForNextChild
        
            -- Best first search
            childRoutes = filter (not . null) $ map (\child -> search' child (end : path)) bestFirst
      
            first | null childRoutes = []
                  | otherwise        = head childRoutes

Two things have changed from the original code. The first is that the graph is built by comparing the edits against the dictionary, rather than the word against the whole dictionary. This is the main saving and makes it *hugely* faster (thanks to jkkramer for the pointer and this post.) The only other change is that we decide which node to search next based on how close it is to the goal (a best-first search). With these changes it can now solve all of the problems I've tried at wordchains.com. Neat. The complete code is available here.

Friday 12 November 2010

Google Refine

Google Refine is a toy for playing with data, originally developed by the team behind Freebase. It's a tool for taking messy data sources and getting some structure in with them. This is exactly the same sort of thing that I need for my super-secret-world-changing-awesome-side-project. Ok, it's not that super yet, not released and will probably not change the world, but it is fun and it does require adding some structure to the messy data sets that can easily be found on the web.

Refine can import data from a variety of sources from either the web or a data file on disk. To have a quick play with this, I grabbed the list of UK Prime Ministers from here and spent a couple of minutes in Emacs to reformat it as a CSV file. Once you've imported that data into Refine, you can automatically reconcile this data with structured information available online (such as Freebase). This gives you the best matches for each name. The screen below shows the results after reconciling with Freebase.

I'm not sure if it's entirely clear from the image but about 40% of the prime ministers have automatically been correlated with the appropriate entry from Freebase! A few more clicks and the entire data set can be reconciled against an on-line source and then exported in a variety of formats including HTML, CSV and JSON. This is exactly the kind of data matching I'm after, as once you've got a Freebase ID you can look up all the extra information very easily. So easy, it almost feels like cheating!

Saturday 23 October 2010

Duplicate symbol my_inet_ntoa when upgrading Yesod

Upgraded Yesod today, but found I could no longer run the application.

  GHCi runtime linker: fatal error: I found a duplicate definition for symbol
     my_inet_ntoa
  whilst processing object file
     /usr/local/lib/network-2.2.1.7/ghc-6.12.3/HSnetwork-2.2.1.7.o
  This could be caused by:
     * Loading two different object files which export the same symbol
     * Specifying the same object file twice on the GHCi command line
     * An incorrect `package.conf' entry, causing some object to be
       loaded twice.
  GHCi cannot safely continue in this situation.  Exiting now.  Sorry.

Haskell packages are described here. Running ghc-pkg-list showed me that network-2.2.1.7 was mentioned in the usr/local/lib ghc package.conf.d and also that a later version network-2.2.1.9 was also present.

To solve this problem, it seems like I had to simply do cabal upgrade http. Perhaps a dependency is missing? Not entirely sure, but thought I'd note the problem here so someone else encountering it will find it!

Thursday 16 September 2010

Let's get hashing.

The usual job of a hash function is to convert a large bit of data (such as a string) into a simple integer representation. A perfect hash function guarantees that each bit of data you provide to the hash function gives you a unique integer. In the real world, hash functions aren't perfect, but usually the goal is to minimize the different sets of data that has to the same value.

One application for hashes such as MD5 or SHA1 is verification that a downloaded file is the same as the one of the server. A hash signature is given on the server, and this can be verified locally after downloading. A small change in the downloaded item results in a huge change in the computed hash function value. For example using an MD5 hash:

"A Quick Brown Fox" hashes to 138a8e4a3e0fa3d62211e11a08917072

"A Quick Brown Fix" hashes to e6907da1d6917f6ce3befcc628d332f7

This is also great for detecting duplicates - files with the same binary content can be found simply by comparing their checksums. But what if you want to detect near duplicates? Documents that are more or less similar, but not quite?

Locality Sensitive Hashing is a simple algorithm where similar features hash to similar values. Sim-hash [PostScript file] is once such algorithm by Moses Charikar. I hope that the algorithm can be roughly described in the following steps for a feature vector V

Hash each element of the feature vector using a standard hash algorithm
Create a weight vector the same length in bits as the hash value for each hash. Values are set to 1 if the bit is set, and -1 otherwise. Sum these to produce a single weight vector for the whole feature vector
The similarity hash is created by turning this weight vector into a big binary number with >0 meaning a bit is set.

Converting this over to Haskell is pretty simple (especially thanks to this Erlang implementation and this explanation).

I chose to use the Jenkins hash function purely because Real World Haskell has the code for it (see the chapter on Advanced Library Design).

So we can test this out simply now. Similar feature vectors hash to similar values.

  > computeHash (FV "A Quick Brown Fox")
  10515216444980845459

  > computeHash (FV "A Quick Brown Fix")
  10519438844509313944

  > computeHash (FV "Hopefully Different")
  9706485515645876927

Because of the way the hash is calculated, the ordering of the feature vector is irrelevant, so hash [1,2,3] is the same as [3,1,2] and so on. To measure the distance between two hashes, the Hamming Distance is used. This is simply just a count of the number of bits that differ. A naive implementation of this is shown below (using the Data.Bits) package.

This code is easy to follow and looks right, but it's not particularly fast. A much faster way to count the number of bits is shown in the K&R C Programming Language. The implementation in the C language is discussed in more detail here and this translates to the following with the algorithm running in time proportional to the number of bits set.

We can verify that the implementation performs exactly the same functionality by challenging QuickCheck to prove us wrong.

> quickCheck (\x y -> hammingDistance x y == hammingDistance2 x y)
  +++ OK, passed 100 tests.

Note that you'll need an appropriate definition of arbitrary for Word64, such as the one here. Next up, is it actually any faster? Criterion tells me that it's quite a bit faster (about 50%).

benchmarking hamming/Naive
  collecting 100 samples, 229 iterations each, in estimated 956.2516 ms
  bootstrapping with 100000 resamples
  mean: 42.66142 us, lb 42.52916 us, ub 42.95010 us, ci 0.950

  benchmarking hamming/Ritchie
  collecting 100 samples, 388 iterations each, in estimated 955.6707 ms
  bootstrapping with 100000 resamples
  mean: 24.59006 us, lb 24.53693 us, ub 24.65282 us, ci 0.950

Ok, basics sorted, so what can we do with sim-hash? One application is near duplicate detection, and to test this out I grabbed a dataset from RIDDLE with the Zagat and Fodor food guides and the associated restaurant address. The addresses are similar, but not quite. For example

  "Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 Californian"
 
    vs.

  "Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 310/472-1211 Californian\STX"

Pretty similar but not quite the same. A quick (and hugely inefficient - I compare everything against everything) function to compare the two is shown below, which returns the list of all near dupes according to the hamming distance specified.

A more efficient implementation of navigating the hamming space is given in the paper Detecting Near-Duplicates for Web Crawling [PDF] by by Gurmeet Manku, Arvind Jain, and Anish Sarma. For this little daft program, I can tolerate waiting a few seconds!

According to the readme include with the data set [tar.gz] there are 112 matches between the two sets of data. Running with various hamming distances I get the following results:

  -- Compare the lists, deduplicate and run for various hamming distances
 > map (\x -> length $ Data.List.nub (map fst $ compareLists zagats fodors x)) [0..5]
 [3,9,27,59,124,197]

So with a Hamming distance of 0, three results are returned (i.e. there are only 3 exact duplicates between the two data sources), whereas with a Hamming distance of 5 then way too many duplicates are returned. Eye-balling the results, a Hamming distance of 2 seems to give the best set of actual closest matches (but still with a fair few false matches).

  -- Good
  "Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 Californian",
  "Hotel Bel-Air 701 Stone Canyon Rd. Bel Air 310/472-1211 Californian\STX"

  -- Good
  "Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 818-788-3536 French Bistro",
  "Cafe Bizou 14016 Ventura Blvd. Sherman Oaks 818/788-3536 French\STX"

  -- False match
  "Cassell's 3266 W. Sixth St. LA 213-480-8668 Hamburgers",
  "Jack Sprat's Grill 10668 W. Pico Blvd. Los Angeles 310/837-6662 Health Food\STX"

   -- Good
  "Granita 23725 W. Malibu Rd. Malibu 310-456-0488 Californian",
  "Granita 23725 W. Malibu Rd. Malibu 310/456-0488 Californian\STX"

This is almost certainly because the choice of the ASCII values of each item in the string is a poor choice of feature vector! Given that we know the data set is a list of addresses, we could take advantage and cheat a little bit and base the feature vector solely on the numbers. Using just the numbers gives 95 perfect matches with a Hamming distance of zero. A hamming distance of 1 covers all the matches (and some false matches too - since it's order invariant and there aren't a whole host of numbers). Sim-hash is better suited for large documents with decent feature vectors!

Wednesday 8 September 2010

The Beginnings of a 6502 Emulator in Haskell

It's been ages since I've written some assembly code, but I'm also enjoying learning Haskell. Therefore, my next random coding exercise is to code a simple CPU emulator.

The 6502 Processor is an 8-bit processor introduced in 1975 and it's still used in embedded systems. It was a hugely popular chip used by such classic machines as the Atari 2600, the original Nintendo Entertainment System and the BBC Micro.

The 6502 Micro Processor!

A CPU is defined by its instruction set architecture (ISA). According to Computer Architecture - A Quantitative Approach (seems to be the authoritative book on Computer Architecture) an ISA is defined by:

Class - Most processors today are general-purpose register architectures where operands are either registers or memory locations. General purpose GPUs are a different class of ISA
Memory Addressing - Almost every processor uses byte addressing to access memory. Alignment can be an issue for performance and some processors such as the Itanium have more specific alignment requirements
Addressing Modes - An ISA can specify various ways of addressing memory. Examples include register, constant and displacement. The 6502 processor supports a dozen or so different addressing modes. A modern day Intel 64-bit processor supports even more including another level of indirection known as segment addressing.
Types and Sizes of operands - As an 8bit processor the 6502 just supports 8 bit operands. A more sophisticated architecture like the 80x86 supports various sizes of integers and floating point
Operations - There used to be a divide amongst RISC / CISC. Now it seems most modern processors are a combination of the two, though I guess GPUs are the emergence of simple instructions again?
Control Flow Instructions - The various choices for branching. The 6502 supports a simple selection of branching instructions that move the program counter based on some arithmetic test
Encoding refers to how the assembler instructions are encoded into byte code. There are two choices here, fixed length encoding which is easier to read but may take up more space and variable length encoding which is slower to deal with.

Thankfully the 6502 is one of the simplest processors available. After reading this great description of the registers, I modelled the CPU with the following simple data structure.

The RAM (a maximum addressing space of 64Kb) is a mutable Data.Vector. The CPU consists of a number of registers. The program counter indicates where the next instruction to execute is going to come from. Jump instructions move the program counter to a new location. The xr and yr registers are commonly used to hold offsets or counters for memory addressing. The accumulator (ac) register is used by most arithmetic and logical operations. The status register contains various flags represented by Flag. These flags can be set as instructions are executed and are hopefully self-explanatory. Finally the stack pointer contains a pointer to the next free location on the stack. The stack is held within a 256 byte stack between 0x0100 and 0x01FF.

In order to access the memory we need to understand how the various addressing modes work, again the 6502 site provides a very clear description.

Accumulator indicates that the instruction works directly on the accumulator. Immediate is an 8 bit constant value within the constructor. In assembler this is usually indicated with #VAL. A ZeroPage address is a byte offset relative to 0, so this only allows indexing into the first 256 bytes of memory. ZeroPageX and ZeroPageY addressing modes use a zero page address together with the offset specified in the xr and yr registers. Relative addressing is only used by the branch instructions and gives a signed 8 bit number to indicate where the program counter should jump to. Absolute, AbsoluteX and AbsoluteY are like ZeroPage addressing but allow access to the full memory range because it supports a full 16 bit address range. Finally there are 3 indirect addressing modes. Indirect gives a 16 bit address which identifies the location of the LSB of another byte that contains the real target of the instruction. IndexedIndirect and IndirectIndexed is used similar to indirect but uses the xr and yr registers to index with offset.

Once we've got our address mode we need a handful of functions to read from the various memory addresses.

These functions do exactly as they say on the tin. With a few small exceptions... For example, readWord16 should never be called with an AddressMode of accumulator (you can read 16 bits from an 8 bit value). For now I've just left these functions with "error" definitions until I can think of a better way of expressing it.

After getting these bits and pieces in place it's time to look at the CPU instructions, apparently developed by Cyberdyne Systems and used in "The Terminator".

Instructions consist of a three letter op code, together with an optional argument. For example, LDA loads the supplied value into the accumulator and sets the zero or negative fields appropriately. For some simple instructions no argument is needed (for example, CLC clears the carry flag but requires no argument).

Currently, I've represented instructions as just the three letter mnemonic and an optional address mode. In the future I'll try and make it more a specification by including the flags the operation is allowed to set, together with restricting to only the allowed addressing mode (for example, some instructions don't support all addressing modes, LDA wouldn't make much sense with a relative address).

Armed with this, all that needs to be done is implement the various operations. A simple function execute :: CPU -> Instruction -> IO () simply matches the instruction and executes it.

Finally after all that we can execute a really simple lump of code to see if it works.

In the example above we load a value into the accumulator. Shift it left (i.e. multiply by 2), store this value in some location in memory. Shift it left again (multiply by 4) and left once more (multiple by 8). We then add the original multipled by 8 with the value in the accumulator. For this really simple lump of code, it seems to work :)

My next plans are to:

Make things more type-safe (e.g. no wrong addressing modes, no modifying the wrong flags)
Write a simple parser using Parsec so that I can write little bits of assembler in a nicer syntax
Finish implementing the remainder of the operations (need to get the design right before I go much further)
Encode the instructions so that I actually use the program counter and potentially load in pre-existing code.
Adding some IO to actually make it useful...

Sunday 29 August 2010

Speeding up the Ants program

In my previous post, I looked at implementing Ants in Haskell. As comments on the blog indicated, things started off OK, but performance rapidly got worse. This is the first time I've had this kind of problem before, so I thought I'd document how I started to solve it.

Haskell has a huge range of performance testing tools to help find out what goes wrong. Real World Haskell has a greater chapter on "Profiling and Optimization" that helps identify the tools available.

The first tool in the chain is simply to run the application and collect some basic statistics about the runtime. You can do this with the command ./AntsVis +RTS -sstderr. Anything after the "+RTS" is a Haskell runtime parameter. This produced the following output:



1,835,837,744 bytes allocated in the heap

328,944,448 bytes copied during GC

2,908,728 bytes maximum residency (25 sample(s))

142,056 bytes maximum slop

9 MB total memory in use (1 MB lost due to fragmentation)



Generation 0:  3483 collections,     0 parallel,  1.54s,  1.54s elapsed

Generation 1:    25 collections,     0 parallel,  0.09s,  0.07s elapsed



INIT  time    0.00s  (  0.00s elapsed)

MUT   time    3.04s  (  3.13s elapsed)

GC    time    1.63s  (  1.61s elapsed)

RP    time    0.00s  (  0.00s elapsed)

PROF  time    0.00s  (  0.00s elapsed)

EXIT  time    0.00s  (  0.00s elapsed)

Total time    4.67s  (  4.74s elapsed)



%GC time      34.9%  (34.0% elapsed)



Alloc rate    603,893,994 bytes per MUT second



Productivity  65.1% of total user, 64.2% of total elapsed

Ouch... The program spent 35% of the time doing garbage collection and only 65% of the time doing useful stuff. It also allocated 603MB of memory / second! That's clearly not acceptable and I'm definitely allocating more memory than I should!

I lack the intuition to understand where the problems are from looking at the code, so time to break out the profiling program. GHC gives quite a few compiler options, but for profiling these seem to be the important ones:

-prof - Enables profiling
-caf-all - Constant Applicative form for all top-level items (constant costs, one for each module.)
-auto-all - Cost-centre analysis for every top-level function

After you compile with those functions, you can run the program again with ./AntsVis +RTS -hc -p to get heap information together with profiling. That gives me the following picture. Flat things are good, things going up at a steep gradient and bad.

First Run of the Profiling Tool

As you might be able to see from the image (but click to see the full one) the finger is pointed squarely at the chain of functions involving updateTVar and evaporate, the code for which is shown below:

So how come this tiny inoffensive lump of code is allocating a fuck-ton of memory? The blame lies solely in the record update function (c {pheromone = pheromone c * evapRate}). Laziness means that this isn't evaluated fully, which means it keeps a reference to the previous value of the cell. I'm not interested in the value, but because there is a reference to it, the runtime can't though the value away. Time for some strictness annotations. Note that I've also done a few refactorings (the World data type was just a wrapper around Vector, so I got rid of the wrapper, and typing pheromone in 100 times was driving me nuts).

The seq primitive is of type a -> b -> b and evaluates the first argument to head-normal form and returns the second. Hopefully, this should force the evaluation of the cells and free the memory up immediately. Running with the profiler again gives a much better picture:

A little bit of strictness added

Much better! Maximum memory usage is significantly down and when I run it my CPU usage is significantly down. seq isn't a panacea though. Earlier attempts at adding bang patterns everywhere led me astray - if you make everything strict, then you can get unintended consequences (I was having problems with fromJust being called on Nothing - not entirely sure why). I think I'll stick to putting ~~seq~~ strict annotations based on the information from the profiling tools, at least I until I have a better intuition for how it will affect the final result.

Updates thanks to very helpful comments from augustss, don and gtllama!

seq is not the right way to achieve the strictness, see comments from Don Stewart about funbox-strict-fields and ! patterns for strictness.

I've removed the Criterion graphs as I don't think the way I generated them was reliable. I believe my use of "seq" was completely bogus and laziness bit me. The faster times almost certainly came from not evaluating the expression correctly once seq was in place. Rather embarrassing!. From the comments, "Computing a value of type STM () does not execute it. STM code is only executed if it "meets up with" an 'atomically' that is executed."

I also go to the bottom of the slowness updating. Most definitely not an STM bug. The problem occured because I was trying to do a single STM transaction to do evaporation - this meant that evaporation could only succeed if nothing else wrote over the pheromones used in the evaporation. Since many ants are marching and dropping pheromones this seems to mean that this transaction would need to be retried many times. Switching this over to use a sequence of atomic operations (one per cell) removed this problem. Performance is now consistent and when I run with profiling I get a flat heap output as shown below:

Flat as a Pancake

This was definitely the most complicated Haskell program I've attempted so far, but also the one I've learnt the most from! Probably still some more mistakes in the code to find, but I've pushed the latest code to the repository.

Friday 27 August 2010

Ants and Haskell

Software Transactional Memory (STM) is a concurrency control mechanism designed to simplify programming for shared memory computers. Beautiful Code contains a great introduction to the concepts of STM in the Haskell language

In most languages, locks and condition variables are the main mechanisms for controlling access, but these are notoriously hard to get right. Java Concurrency in Practice is a good read to understand just how many ways there are to shoot yourself in the foot (too few locks, too many locks, wrong locks, race conditions, wrong order, error conditions, deadlock, livelock, live stock, brain explosion). STM simplifies shared memory programming by providing database like semantics for changing memory. Reads/Writes to shared memory happen within a transaction - each memory access appears to happens in isolation from the others and appears atomically to observers. If a transaction conflict with another, then one of the transactions is retried. An implementation typically records the memory accesses somehow and then can decide whether there was a conflict. Languages that restrict mutability (like Clojure and Haskell) have a significantly simpler implementation than imperative languages such as C/C++.

Composability is another advantage for STM. For example, take java.util.Hashtable - what if you want to do an insert/delete as a single atomic operation and only make the contents visible to other threads once finished? As the original design didn't do this, you're on your own. In contrast STM composes well.

Both Clojure and Haskell feature support for STM. The canonical example in Clojure is Ants.clj that demonstrates STM via a simple simulation of foraging ants (see also Flocking about with Clojure). As a learning exercise I thought it'd be neat to try to convert this over to Haskell!

Ants in Haskell

To model the ants world, I use the following data structures. Transactional variables (TVars) are used to hold a reference to a mutable variable. For the Ants simulation, I use a Vector of TCell's to represent the ants world.

TVars can only be modified within the STM context. The key thing is that the only way to mutate transactional variables is from within the STM monad. To fiddle with the variables within TVar you can use newTVar, readTVar and writeTVar. Oddly there didn't seem to be a primitive operation to update a TVar based on its current value. updateTVar updates the TVar by applying a function to the value inside.

check verifies that a condition is true and if it isn't true then the transaction is retried. The key point is that the transaction is only retried when there's a reason to do so (e.g. memory read/write) so you aren't just heating the CPU whilst the condition is being validated. As an example, when we move an ant forward, we want to check that there is not an ant in the way. If there is an ant in the way, we'll wait till the coast is clear before moving.

At some point you have to run your STM actions. atomically runs the STM transactions from the IO monad (e.g. the top-level) and returns the result. A very important point is that you want your actions to be in the STM monad as much as possible. If all of your functions are run within the IO monad then you lose the composability aspect. The pattern to use is make everything use the STM monad and glue together randomly and you won't have a threading problem (you still have other problems though, it's not magic).

The Clojure code used agents to represent each ant. I'm not sure what the most idiomatic translation to Haskell is, but I spawned a thread for each ant using Control.Concurrent and forkIO. Haskell threads are incredibly light-weight so spawning even thousands of them is not a problem. Each ant thread simply evaluates the behaviour, moves, sleeps and repeats.

The rest of the code is more or less a direction translation from the Clojure. It's pretty verbose so I won't bother posting it here, but the full code is on my github page. You should be able to compile it with ghc -lglut --make -main-is AntsVis AntsVis.hs. Any hints on how to make it suck less appreciated!

Performance seems very good with the default compiler options, I'm able to run with 100+ ant agents all running concurrently. The programming is very simple and once I'd found out and added the appropriate check logic into move everything worked properly.

Hurrah for simple concurrent programming!

(update 30/8/2010 - after finding out the performance sucked after a while with a few more ants than I'd tested with I looked at speeding up the Ants program).

Tuesday 17 August 2010

Freebasing with Haskell

Freebase is a collection of structured data, queryable via a powerful API and recently acquired by Google.

Data is categorized according to different types. A Topic represents a thing (such as a physical entity, a substance or a building). Each Topic is associated with a number of Types, for example the Salvador Dali topic might be associated with Painting and Surrealism. A group of related Types are organized into a Domain. For example, the books domain contains types for book, literature subject and publisher. Domains, Types and Properties are further organized into namespaces which can be thought of top-level containers to reference items. For example, the /en namespace contains human-readable IDs and URLs for popular topics. Similarly, the /wikipedia/en namespace gives a key that can be used to formulate a URL representing the corresponding article on Wikipedia. This extra structure allows precise concepts to be expressed, with no ambiguity.

FreeBase has a number of web services you can use to interrogate the database. The web services accept and return JSON. The most basic services just give you information about the status and version in JSON format. The Text.JSON package provides a method for reading / writing JSON in Haskell. This, coupled with Network.HTTP, makes making basic requests very simple.

More advanced requests use the MQL to express queries to Freebase. The underlying database is best thought of as a directed graph of nodes and relationships. Each node has a unique identifier and a record of who created the node. A node has a number of outgoing edges which are either relationships between other nodes of a primitive value. For example the /en/iggy_pop node might be linked to the /en/the_passenger/ via the >/music/album/artist property. Relationships between nodes can have property values associated with them, for example /en/the_passenger might be linked to /music/track/length with 4:44.

The Query Editor provides a way of running these queries in a web browser and this, together with the Schema Explorer allow you to get explore the data available in Freebase. Queries are expressed as JSON and consist of a JSON object from which the blanks are filled in. As an example, if I submit the following query with a blank array, then the returned value has the null filled in with the correct details (apparently Indiana Jones was released 23rd May 1984).


  {
     query: {
         type: "/film/film"
         name: 'Indiana Jones and the Temple of Doom',
         initial_release_date: null
     }
  }

The Freebase documentation gives an example of a basic web app built using PHP and I thought it'd be a good learning exercise to convert this over to a functional programming language. Haskell has quite a few web frameworks, including Snap, HappStack and Yesod.

Yesod had the strangest name, so it seemed like the logical choice. Yesod uses some extensions to Haskell, Type Families, quasi-quoting and Template Haskell. Quasi-quoting and TH are particular exciting as they allow a HAML like syntax to be used as statically compiled Haskell.

Each Yesod application has a site type that is passed to all functions and contains arguments applicable to the whole site, such as database connections and global settings. URLs are handled by a routing table which, again, is statically compiled and means you can not create an invalid internal link. Neat!

For the basic album lister app, we define three really simple routes. A home page, a URL to call to get the albums, and a link to the static files. Note that the static files is again checked at compile time. If the relevant JS/CSS files don't exist, then the application fails to compile!

The getHomeR route handler (which ends in capital R by convention) simpler just renders a Hamlet template.

A little bit of JavaScript in script.js makes an Ajax call to retrieve the list of bands.

And the basic example from the MQLRead documentation is converted over. I've only spent a few hours with Yesod but I'm definitely impressed so far, good documentation and easy to get something simple done! The complete code is on my github page

Friday 13 August 2010

A Brief History of Java

In 1995, Sun released the The Java programming language as a part of a broader strategy known as the Java platform. The "write once, run anywhere" (WORA) motto initially promised to make Java the everywhere language, running on everything from wrist watches to cell phones and laptops to supercomputers.

The initial reception of Java was mixed. For every Java sucks, an evangelist promised that it would change the world, and Java powered toasters were just around the corner.

As Java evolved, it accrued more baggage as it fought to keep WORA. Deprecated methods sprang up everywhere, but Sun had to keep these features in place to provide backwards compatibility. The java.util.DateTime package became synonymous with dysfunctional design and bad naming conventions were fixed for life (is it size or is it length?).

Microsoft's .NET began to encroach on Java's domain. Microsoft's team introduced delegates, a type-safe function pointer that makes event handling considerably simpler. Java needed a competitor, and fast, so in version 1.1, Java grew inner classes, a way of achieving a similar effect, but in a more limited, laboured fashion. A Java white-paper concluded "Bound method references are simply unnecessary.... They detract from the simplicity and unity of the Java language". Meanwhile, efforts continue to this day for Java to adopt bound method references.

When Java reached version 1.4, Sun decided that a new approach was required to compete with Microsoft's .NET strategy. Sun thought long and hard and branded version 1.4 as "5" in an attempt to move ahead of .NET 2.0

This was accompanied by a decision to implement generics, a way of achieving additional type safety. Unfortunately, type-safety came at the cost of typing. As engineers adopted generics they were often heard cursing as they bashed out yet another List<Foo> foos = new ArrayList<Foo> statement.

Universities quickly adopted Java; no longer did students have to learn the intricacies of manual memory management and pointers, instead they could rely on Java to do the heavy lifting and focus on solving problems. Unfortunately, this led to the production of a league of developers known as the "Patternistas" who only had a hammer and everything was a nail. Under their leadership, Java naming conventions became increasingly ridiculous. When class names such as RequestProcessorFactoryFactory became common place some developers began to question the wisdom of the infinite tower of abstraction.

As developers realized they were just shuffling thousands of lines of code around each day, they needed a word to justify their existence. That word was refactoring. The patternistas rejoiced; not only could they apply factory factories, singletons and visitors to solve problems, but they could repeatedly change their mind and justify it with a buzzword!

A whole industry evolved to satisfy the patternistas. RSI injuries were becoming increasingly common place amongst Java veterans so a new breed of development environment was built. IntelliJ and Eclipse were built with the aim of minimizing the damage to developers. Advanced code completion and refactoring meant that developers merely had to press key combinations instead of typing in verbose code constructs for missing language features.

If all you have is a hammer, everything looks like a nail

Java began the push for the Enterprise by employing some of the leading architecture astronauts to come up with a paradigm-shifting, synergistic approach for empowering enterprise software. The result was nothing short of a revolution; beans were born. Beans are apparently a server-side component architecture for the modular constructor of enterprise applications.

Having softened up resistance to angle brackets through the use of generics, Java moved forward and jumped on the XML bandwagon. By using XML, developers were able to express concise concepts as huge, verbose angular nightmares. This has the advantage that XML files (unlike other files) can be easily read by other computers. The small price of being unreadable for humans was felt to be worth paying. Services such as Ant and JBoss led the way with executable XML and powerful deployment descriptors.

Meanwhile, in a galaxy far away from architecture astronauts, a new breed of programmer decided that getting shit done was more important than typing shit all day long. This led to the birth of frameworks such as Rails which are designed to get out-of-the-way and let you solve problems, an approach popularized as convention over configuration. Rails received positive feedback and at least some former Java addicts kicked the habit for Ruby. The first signs of Java's grip loosening were becoming apparent.

In August 2006 the Java 7 project began. Many developers are pushing for a feature known as lambda expressions that would undoubtably simplify many of the common coding tasks that Java makes so painful. Unfortunately, four years later the Java committee is still arguing over the nuances of this feature and it's possible that this will be dropped. The lack of movement on Java 7 has led to the birth of new, sexier languages such as Clojure and Scala, designed to run within the Java ecosystem, but without the need for the language itself.

The final nail in the coffin started being hammered in April 2009 when Oracle announced plans to acquire Sun. Headed by "Larry, Prince of Darkness", Oracle is an acquisition machine that specializes in Enterprise Software and making money. When Oracle's lawyers uncovered a set of software patents, they picked a big target and started a fight. Targets come no bigger than Google (a leading advertising organization), so Oracle's lawyers pounced and battle has commenced.

Where does this leave Java? From its humble beginnings 15 years ago, Java has risen to the top of the pile of popular programming languages under Sun's stewardship. Under Oracle, it's unclear whether the next version of Java will ever get released, let alone contain the features developers crave. Is this the beginning of the end for Java?

Saturday 7 August 2010

Sniffing out Pacman

How would you write the AI for ghosts in Pac-Man? Your first thought is probably to centralize the logic in a ghost object/function and go from there. This approach has quite a few traps, and a lot of complexity, for example how do you deal with dead-ends? Eventually the algorithm is going to get pretty complicated (something like A* search)

The excellent paper Programming Anti-Objects shows a very simple way to solve this problem with a bit of lateral thinking. Instead of centralizing the logic for ghost behaviour, why not distribute it? Reconceptualizing the problem in this way brings the background objects to the fore - the behaviour does not live in the ghost; it lives in the background tiles. The simple approach described in the paper is based on diffusion - each target has a scent that is distributed via the background tiles. Complex behaviour can emerge because some tiles distribute the scent (paths) whereas some block scent (obstacles).

I've represented the tiles as a map from Point to a list of Agents. The list is really a stack where the top-most agent is the only one that really matters. I've assumed that the tiles are arranged in a square, hence the one size variable and I keep the current location of the pursuers and goal handy.

The algorithm consists of two stages; diffusing the scent over the grid followed by updating the pursuers using a simple hill-climbing approach (e.g. move to the surrounding tile with the highest scent).

I've looked at the diffusion algorithm before in Barely Functional Fluid Dynamics, but I implemented it in an imperative style that makes it complicated to understand. This time around, I looked at a purely functional solution. Calculating the diffusion for any cell is a matter of doing some simple calculations on he four cells around it (also known as the Von-Neumann neighbourhood). The equation below is taken from the paper:

Diffusion Equation from Colloborative Diffusion - Programming AntiObjects

One way to solve this in Haskell would be to use some mutable state and just write some imperative-style code to mutate it in place. This would probably give the best performance, but at the cost of making the code destructive and difficult to follow. Alternatively, we can express the whole thing functionally. When I looked at the Floyd Warshall algorithm, it made use of dynamic programming and in particular defining a data structure recursively. We can use a similar trick to define updates to the grid in terms of previously calculated values.

In the image above, if we are calculating the diffusion value for the red-tile, then we can reference the "new" diffused values for the green tiles, but we still must use the original values for the blue tiles.

Ignoring the update pursuers function, we update the board by constructing a map based on the previous environment and itself.

Updating the pursuers is simple, just find the nearest path cell with the highest scent and move to it. Putting this all together and we can run some little tests. The example below shows a pursuer (in green) solving a little maze. It takes a while to get going whilst the scent is diffused, but once it is it homes in on a solution very quickly.

As explained in the paper, collaborative behaviour emerges because pursuers block the scent from the goal. This means if you have multiple paths to the goal, the pursuers will each choose a different route purely because once one pursuer is on a route it blocks out the scent for those that follow. In the example scenario below, the agents will never take the same route and will always ensure there's no escape for the target.

The code for this is available on my github page together with a simple GUI that allows you to build an environment and run the simulation.

Wednesday 28 July 2010

Stop the Traffic

Traffic flow is a strange phenomenon. The complex interaction of drivers results in strange emergent behaviour, such as snarled up traffic even in what seem like ideal driving conditions.

A microsimulation is a modelling technique that models individual units. Microsimulations are used for a number of items, including health sciences, pedestian modeling and traffic simulation.

My simple traffic microsimulation is based on a simple car-following model. If there's nothing nearby, drivers keep a constant speed. Seeing a car ahead, drivers have the irresitable urge to catch up with them, and thus put their foot down. Seeing a car ahead that's a little too close, drivers slam their brakes on. Overtaking is strictly prohibited! This is a deliberately simple model, just to see if the behaviour emerges; there's a number of much sophisticated models available such as Gipps' or Intelligent Driver Model.

The simplest model I could think of is a giant round-about. Cars circle the roundabout as fast as possible, assuming the basic model described above. This results in a nice traffic shockwave much like the one shown in the video below.

The main data types are shown below and are (hopefully?) self-explanatory. Rather than store the position of a car directly, I just store the distance to destination. This makes placing the car and determing the distance between them considerably simpler. I've used an infinite list of randomness to provide a source of noise to the simulation.

The actual logic of the simulation is equally simple. All we have to do is reposition the cars based on their speed. I've made lots of simplifying assumptions. For example, I'm only considering cars on the same route as being near and I'm not allowing cars to overtake.

Putting this together with some OpenGL code gives the following.

The full code for this is on my GitHub page and (as always!) any suggestions on how to improve the code are greatly received! I have a sneaking suspicion I should be using a State monad somewhere in here, but it's not quite clicked how yet!

(edit 3rd August 2010)

I updated the code in the gist with the corrections that Edward Kmett kindly provide. Much better looking results now, though my screen capturing skills still lead a little to be desired. If you download the code you can now change the speed / number of cars interactively and see what happens.

Thursday 15 July 2010

Foreign Exchange Arbitrage

Arbitrage exploits a price difference between two or more markets to make money with zero risk.

Sporting bets are perhaps the easiest example. Given a sporting event with two possible outcomes you look for an opportunity where two book makers give slightly different odds and try to expoit it. For example, consider a case with two outcomes, say Federer vs. Nadal. One bookie offers odds of 5/2 on Federer, whereas another offers odds of 3/5 on Nadal. If I bet £32 on 5/2, and £68 at 3/5 then regardless of the outcome I'm guaranteed to make a little money (a minimum of 8% in this case. £32 at 5/2 odds gives me $112 and £68 at 3/5 gives me £108). Wikipedia's page on arbitrage betting explains it better than I can!

Foreign exchange (forex) markets are another opportunity for the arbitrageur. By finding mispriced currencies you can get money for nothing. For example I could transfer my money from GBP to USD via CHF and end up with a profit if someone got these exchange rates wrong! Now all I need to do is find a way of finding these opportunities and I'll be rich!

The exchange rates can be represented as a weighted graph with each node representing a currency and each directed edge representing the exchange rate. To find an arbitrage opportunity we need to find a path through the graph (starting and ending at the same node) where the product of edge weights is greater than 1.0. The shortest path is the best one, because we want to make as few trades as possible. The Floyd Warshall algorithm is an efficient algorithm for finding the shortest paths in a weighted graph.

In the graph below we can make a trade 1 -> 2 -> 4 -> 3 -> 1 and make a profit.

Floyd Warshall is an example of dynamic programming. A dynamic programming solution has three components (from The Algorithm Design Manual):

A recurrence relation or recursive algorithm for expressing the answer
Show that the recursive version is bounded by a small polynomial
An order of evaluation for the reccurence that means you always have partial results available when you need them.

We can define the arbitrage problem recursively. First we label the nodes as being number from 1 to N. The shortestPath(i,j,k) gives the shortest path from i to j using only nodes 1 to k as intermediate nodes. The base case is shortestPath(i,j,0) where the value is simply the edge weight between i and j. The recursive relation is shortestPath(i,j,k) = min(shortestPath(i,j,k-1), shortestPath(i,k,k-1) + shortestPath(k,j,k-1). The recurrence relation just says that each new node that we consider only helps if there is a shortest path that goes through it.

In order to code this up in Haskell we'll need some representation of a graph. A graph is simply a collection of vertices and some edges. In order to make things easier I've said that each vertice should be an enum and have an integer representation.

I've needed two language extensions (multi parameter type classes and functional dependencies) which seems overkill! Functional dependencies is used in Graph a b | a -> b to state that b is uniquely defined by a. Any suggestions on the simpler solution would be greatly appreciated!

Dynamic programming solutions can be expressed very neatly in lazy functional programming languages like Haskell. See here for more information, but the basic idea is to define a data structure in terms of itself. If we initialize the data structure in the correct order, then all the previous results are available when they are needed. This makes for a concise solution to the problem.

The Floyd Warshall algorithm has been extended a little above to find the minimum solution and to provide path reconstruction information. Once we have the path reconstruction matrix we can work backwards to reconstruct the series of trades that lead to an arbitrage opportunity. There's only an opportunity if there is a loop from a starting currency that returns with a value greater than 1.

So, I've now got the ability to find an arbitrage path if one exists, now all I need to do is hook my computer up to a live feed of exchange rates, put a little money in and I'll be rich by the end of the day. Just in case it's not quite that easy, I thought I'd better run it on some test data!

There's a good source of test data at GAIN Capital that's available here. I downloaded January's historial data and after a while ended up with 1.3GB of data to sort though. The data is in a simple CSV format, with columns representing the ticket, currencies, data time, bid rate and ask rate. This gave me an opportunity to use Parsec which is apparently an industrial strength, monadic parser combinator library. Thanks to RWH Chapter 16 it was pretty simple to use and understand.

Now that I've got a way of reading in the historical data in, I simply need to convert this data into a form that the Floyd Warshall algorithm can use and look for paths that give a profit. I could have done that directly with the parser code, but I might want to reuse that again for some other cunning plan. Instead I'll convert to a simple map representation, with the keys being pairs of vertices and the values being the exchange rate between those values. The following code takes a list of foreign exchange records, converts them into exchanges and looks for arbitrage opportunities.

Assuming that I've understood the format properly (anyone know any good books to get this kind of basic information?), then running this code through the 21 million (!) ticker updates in January 2010 yields absolutely ZERO arbitrage opportunities. With my limited knowledge, I think this is because of the efficient-market hypothesis. Prices reflect the information available and the updates are incredibly quick (otherwise currency traders would be going bankrupt on a daily basis!).

Back to the drawing board with my money making schemes!

(image taken from Flickr)

Saturday 12 June 2010

Orbit Simulator in Haskell

Uncle Bob recently wrote an article showing a simple orbital simulator in Clojure. Aside from the strange formatting, the code is pretty easy to follow, though there is large amounts of duplicate code (see vector.clj vs. position.clj). I guess the reasoning behind this is to try to stop mixing up units?

A phantom type is one way to avoid this duplication in Haskell. A phantom type is only ever used to construct types. Its sole purpose is for validation by the type checker. The example below uses phantom types to encode position, velocity and force. This means we have no code repetition for each of the vector operations.

When we do the physical calculations to move the objects around the phantom types do their job. As an example, if I try to add two unrelated vectors together I get a compile time error showing that I can't. Here's an example from a ghci session.

  Prelude Orbit> let a = Vec 0 0 :: Vec Force
  Prelude Orbit> let b = Vec 1 1 :: Vec Position
  Prelude Orbit> add a b

  :1:6:
    Couldn't match expected type `Force'
           against inferred type `Position'
      Expected type: Vec Force
      Inferred type: Vec Position
    In the second argument of `add', namely `b'
    In the expression: add a b

The code to do the physical calculations is very simple. The record update syntax is new to me and means you can just change the selected fields in an object, rather than creating a new one and copying across all the fields (see the section on named fields).

I think collideAll looks substantially nicer than the Clojure implementation. From the comments on the blog, I suspect it still suffers from a period of inaccuracy if three objects collide simultaneously (they won't get merged in the first round, but almost certainly will in the second). nubBy is an incredibly useful function I hadn't seen before and it allows you to remove duplicates according to your own definition of equality.

The Clojure version has a bunch of unit tests (as you might expect given the author!). The tests are in the 1 + 1 = 2 style by which I mean give some inputs and validate an output. These are the sort of tests you might capture as part of a REPL transcript, but they give you very little assurance that the code is actually correct.

QuickCheck gives an incredibly powerful way of testing applications by making assertions about the code and challenging QuickCheck to falsify them with generated data. In the case of this simple physics model, we can make assertions like the total amount of energy in the system must be conserved and let QuickCheck worry about generating the cases where this doesn't hold.

In order to do this we first have to give QuickCheck a way of generator random objects to simulate. The Arbitrary type class defines generate, so we just need to make Object an instance of this type class and implement the method.

A few simple quick check properties are shown below. This asserts that for any given vector (apart from one with zero length), the magnitude is 1.0 (subject to a rounding error or two) and that for any given set of objects, the kinetic energy is conserved. This is a very simple example of how powerful QuickCheck can be.

Running Quick Check against this property shows that it holds for a large number of automatically generated test cases. This gives a strong hint that this code actually works.

The final task is to visualize this data. I used OpenGL again and knocked up a quick UI, the source code for which is here. The (terrible) video below gives an example of it in action. If anyone knows of a better way of capturing OpenGL I'd love to know. I'm currently using Istanbul which works, but brings my system to its knees.

The complete source for this is available on my git hub page. Any comments or suggested improvements greatly appreciated!

Finally, the company I work for is currently recruiting in Cambridge, UK. If you're the sort of person who is looking for challenges such as analysing petabytes of data, writing high performance servers , or developing the richest pure JavaScript interfaces imaginable (and a whole lot more in between) then drop me a note at jeff.foster AT acm.org and I'll give you more information. Oh, and if you're accepted you'll get your choice of an iPad or the latest Android phone.

Wednesday 2 June 2010

Monte Carlo Methods and the World Cup

A Monte Carlo method is a method of solving problems by statistical sampling. A typical approach to solving a Monte-Carlo problem is (according to Wikipedia):

Define a domain of possible inputs.
Generate inputs randomly from the domain using a certain specified probability distribution.
Perform a deterministic computation using the inputs.
Aggregate the results of the individual computations into the final result.

One of the simplest example of a Monte-Carlo method is estimating pi. The approach below hurls darts at a 1x1 square. Some of these land within the unit circle, and some outside. By getting the ratio between the two we can estimate the value of pi.

The accuracy of calculating pi depends on the number of samples taken. As you can see below it's not until very large amounts of samples are taken that the number even begins to approach pi.


    10         3.2
    100        3.08
    1000       3.204
    10000      3.142
    100000     3.14072
    1000000    3.143136
    10000000   3.1407732

We can also apply the Monte Carlo method to sporting events. The 2010 World Cup is almost upon us. The current Fifa World Rankings make Brazil or Spain the overwhelming favourities. The draw (PDF) has an element of randomness though, so it's possible that two strong teams could find themselves meeting earlier and thus one of them going home. A Monte Carlo simulation sounds like the ideal way of settling this. Or at least, it's going to be just as wildly inaccurate as the pundits predictions!

The World cup consists of two stages. The group stage, where four teams play each other and the top two advance and the knockout stage in which the remaining teams play until the final is reached.

I've defined a class to represent the model that will be used to simulate the matches. For this example, I've simply based things on the Fifa world Rankings, but there are much more sophisticated techniques that could be used (whether these are demonstrably any more accurate or not, I'm not sure!).

The basic implementation of this type class is incredibly simple. It uses the world rankings and simply compares these to get a result. In the event of a draw, the home team is assumed to win! This could obviously be substantially "improved".

To simulate randomness, I've assumed that each time can play up to 30% better or worse than their ranking would suggest. This means that Brazil will always beat New Zealand, whereas teams that are closer (England) may have some chance!

The code below simulates the world cup - hopefully it makes sense as I've tried to be as self-documenting as possible! rules gives the knockout stage structure - it keeps getting "folded in half" as teams play each other and hopefully accurately represents the real fixtures of the world cup.

So who's going to win the World Cup according to the simulation? I ran for 100K simulations and got the following results:

France (124)

Argentina (365)

Greece (7)

England (239)

USA (4)

Germany (577)

Serbia (1)

Netherlands (3706)

Italy (2240)

Brazil (48025)

Portugal (5249)

Spain (39460)

Chile (3)

Unsurprisingly, this still has Brazil coming up winners almost half the time. The number of times each time wins is mostly related to the ranking (as you'd suspect). Changing the simulation slightly so that teams draw in the group stage if they have a rating within 25 points changes the results a little, but they are still broadly in line with the above.

Still, doesn't matter what the simulation says! England will still march-forward past the group stage and then lose in "unlucky" circumstances (penalties, that goal, penalties again).