Sunday, 29 August 2010

Speeding up the Ants program

In my previous post, I looked at implementing Ants in Haskell. As comments on the blog indicated, things started off OK, but performance rapidly got worse. This is the first time I've had this kind of problem before, so I thought I'd document how I started to solve it.

Haskell has a huge range of performance testing tools to help find out what goes wrong. Real World Haskell has a greater chapter on "Profiling and Optimization" that helps identify the tools available.

The first tool in the chain is simply to run the application and collect some basic statistics about the runtime. You can do this with the command ./AntsVis +RTS -sstderr. Anything after the "+RTS" is a Haskell runtime parameter. This produced the following output:


1,835,837,744 bytes allocated in the heap
328,944,448 bytes copied during GC
2,908,728 bytes maximum residency (25 sample(s))
142,056 bytes maximum slop
9 MB total memory in use (1 MB lost due to fragmentation)

Generation 0: 3483 collections, 0 parallel, 1.54s, 1.54s elapsed
Generation 1: 25 collections, 0 parallel, 0.09s, 0.07s elapsed

INIT time 0.00s ( 0.00s elapsed)
MUT time 3.04s ( 3.13s elapsed)
GC time 1.63s ( 1.61s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 4.67s ( 4.74s elapsed)

%GC time 34.9% (34.0% elapsed)

Alloc rate 603,893,994 bytes per MUT second

Productivity 65.1% of total user, 64.2% of total elapsed


Ouch... The program spent 35% of the time doing garbage collection and only 65% of the time doing useful stuff. It also allocated 603MB of memory / second! That's clearly not acceptable and I'm definitely allocating more memory than I should!

I lack the intuition to understand where the problems are from looking at the code, so time to break out the profiling program. GHC gives quite a few compiler options, but for profiling these seem to be the important ones:
  • -prof - Enables profiling
  • -caf-all - Constant Applicative form for all top-level items (constant costs, one for each module.)
  • -auto-all - Cost-centre analysis for every top-level function
After you compile with those functions, you can run the program again with ./AntsVis +RTS -hc -p to get heap information together with profiling. That gives me the following picture. Flat things are good, things going up at a steep gradient and bad.

First Run of the Profiling Tool
First Run of the Profiling Tool

As you might be able to see from the image (but click to see the full one) the finger is pointed squarely at the chain of functions involving updateTVar and evaporate, the code for which is shown below:



So how come this tiny inoffensive lump of code is allocating a fuck-ton of memory? The blame lies solely in the record update function (c {pheromone = pheromone c * evapRate}). Laziness means that this isn't evaluated fully, which means it keeps a reference to the previous value of the cell. I'm not interested in the value, but because there is a reference to it, the runtime can't though the value away. Time for some strictness annotations. Note that I've also done a few refactorings (the World data type was just a wrapper around Vector, so I got rid of the wrapper, and typing pheromone in 100 times was driving me nuts).



The seq primitive is of type a -> b -> b and evaluates the first argument to head-normal form and returns the second. Hopefully, this should force the evaluation of the cells and free the memory up immediately. Running with the profiler again gives a much better picture:
A little bit of strictness added

Much better! Maximum memory usage is significantly down and when I run it my CPU usage is significantly down. seq isn't a panacea though. Earlier attempts at adding bang patterns everywhere led me astray - if you make everything strict, then you can get unintended consequences (I was having problems with fromJust being called on Nothing - not entirely sure why). I think I'll stick to putting seq strict annotations based on the information from the profiling tools, at least I until I have a better intuition for how it will affect the final result.

Updates thanks to very helpful comments from augustss, don and gtllama!

seq is not the right way to achieve the strictness, see comments from Don Stewart about funbox-strict-fields and ! patterns for strictness.

I've removed the Criterion graphs as I don't think the way I generated them was reliable.  I believe my use of "seq" was completely bogus and laziness bit me. The faster times almost certainly came from not evaluating the expression correctly once seq was in place.  Rather embarrassing!. From the comments, "Computing a value of type STM () does not execute it. STM code is only executed if it "meets up with" an 'atomically' that is executed."

I also go to the bottom of the slowness updating. Most definitely not an STM bug. The problem occured because I was trying to do a single STM transaction to do evaporation - this meant that evaporation could only succeed if nothing else wrote over the pheromones used in the evaporation.  Since many ants are marching and dropping pheromones this seems to mean that this transaction would need to be retried many times.  Switching this over to use a sequence of atomic operations (one per cell) removed this problem. Performance is now consistent and when I run with profiling I get a flat heap output as shown below:

Flat as a Pancake
This was definitely the most complicated Haskell program I've attempted so far, but also the one I've learnt the most from!  Probably still some more mistakes in the code to find, but I've pushed the latest code to the repository.