Fatvat: August 2009

Sunday 30 August 2009

Some Haskell Data Structures

So far I've seen the list data structure, but Haskell also supports items like Maps and Sets.

An association list is a list of tuples of keys to values. For example:


  alist :: [(String,Double)]
  alist = [("pi", 3.14159265), ("e", 2.71828183), ("phi", 1.61803398874)]

  getConstant :: String -> Maybe Double
  getConstant name = lookup name alist

lookup is a prelude function that returns the value (if existing) for the supplied key. Association lists are useful for small lists, but the lookup time is O(N) in the size of elements. Enter Map which provides the same key/value abstraction but with efficient lookup. Data.Map provides operations for insertion, deletion and inspection of keys. For example:


  -- Given an association list, make a map
  Map.fromList aList

  -- Insert a new key/value into the empty map
  Map.insert "pi" 3.14159265 $ Map.empty

  -- Is a key in a map?
  Map.member key map

  -- lookup and findWithDefault allow you to find values.

$ is the function application operator - it's right associative so it means less brackets.

Similarly, Data.Set provides a functional implementation of sets.

We can put these together and change the implementation of the anagrams to create a map of search key to a set of results. This is hideously inefficient initially, but once the data structure is built up finding anagrams should be a little quicker.


  anagramList :: String -> IO (Map String (Set String))
  anagramList file = do
    filecontent <- readFile file
    return (foldl (\x y -> Map.insertWith Set.union (stringToKey y) (Set.singleton y) x)
                  Map.empty 
                  (filter validWord $ lines filecontent))

  anagramsOf :: String -> IO ()
  anagramsOf word = do
    anagrams <- anagramList wordfile
    putStrLn (show (Map.lookup (stringToKey word) anagrams))

Saturday 29 August 2009

IO And Haskell

All functions in Haskell are pure - how do you deal with side effects such as input/output?


  Prelude> :t putStr "hello world"      
  putStr "hello world" :: IO ()

The IO () means that this returns an IO action that has a result type of () (known as unit).

The do syntax can be used to glue actions together. For example:


  saygoodbye = do
    putStrLn "Hi - who are you?"
    name <- getLine
    putStrLn ("Goodbye Mr. " ++ name)

  *Main> saygoodbye
  Hi - who are you?
  Bond
  Goodbye Mr. Bond

  *Main> :t saygoodbye
  saygoodbye :: IO ()

<- performs the getLine application and binds the resulting value to name. Given that getLine has type IO String this gives name a type of String. IO types and normal types can't be mixed so "Die " ++ getLine is an illegal statement (IO String and String don't mix).

return is like the opposite of <- - it takes a pure value and constructs an action out of it. return is nothing like its use in other languages such as Java and C. return doesn't do anything with the execution path, code continues to the next line; return is purely used to construct actions.

So how'd you escape from your IO action? You don't... There's no escape!:

There's one final detail about IO actions that you should be aware of: there is no escape! The only way to get a result from an IO action is to invoke the IO action (through main) and have its result used to affect the outside world through another IO action. There is no way to take an IO action and extract just its results to a simple value (an inverse-return). The only places where an IO action's results appear unwrapped are within a do-block.

Let's try and put this all together to write a quick program that gives all anagrams of a word. The implementation idea is taken from Programming Pearls - load up a dictionary, sort the characters (so "banana" becomes "aaabnn"), then shove it all into an association list. Then given a word, apply the same sort and simply look up the associations.

Unix distros come with a word list file (/usr/share/dict), but it's full of words with punctuation and so on. We need to filter this list to remove invalid words, then build up an association list (list of tuples).


  import Data.Char
  import List

  wordfile = "/usr/share/dict/words"

  stringToKey :: String -> String
  stringToKey = sort.(map toLower)

  validWord :: String -> Bool
  validWord s = (not (null s)) && 
                length s <= 10 && 
                not (any (not.isAlpha) s)

  anagramList :: String -> IO [(String,String)]
  anagramList file = do
    filecontent <- readFile file
    return (map (\x -> ((stringToKey x),x)) (filter validWord (lines filecontent)))

  matchingKeys :: String -> [(String,String)] -> [String]
  matchingKeys k l = map snd (filter ((== k).fst) l)

  anagramsOf :: String -> IO ()
  anagramsOf word = do
    anagrams <- anagramList wordfile
    putStrLn (show (matchingKeys (stringToKey word) anagrams))

stringToKey is a function of one argument which converts a string to a key by making the string lower-case and then sorting the characters. readFile does exactly what is says on the tin (it's lazy too). lines is a built in function which breaks up a string into separate lines. And it seems to work first time too!


  *Main> anagramsOf "least"
  ["Stael","Tesla","least","slate","stale","steal","tales","teals"]

My current understanding (and it's all very foggy!) is that if a function uses IO actions then they will always boil up to the top level. main is the only place where these can be hidden. For example, I could put the loading of the anagram list in one place, use <- in main and then not have to riddle the rest of the program with IO actions. Good design isolates the IO in the smallest segment of the program possible.

Friday 28 August 2009

Testing Times

My last program didn't have anything in the way of tests, purely because I had no idea to write them in Haskell.

There seem to be two major test frameworks in Haskell. The first one, HUnit, is based on the xUnit family. Create test cases which assert various properties of the functions you're testing, bundle them into a test suite and run them.


  foo :: (Num a) => a -> a -> a -> a
  foo a b c = a * b + c

  test1 = TestCase (assertEqual "* has higher precedence" 26 (foo 2 10 6))

  tests = TestList [TestLabel "Foo test" test1]

  -- From the REPL
  *Main> runTestTT tests      
  Cases: 1  Tried: 1  Errors: 0  Failures: 0
  Counts {cases = 1, tried = 1, errors = 0, failures = 0}

Tests like this always feel a bit smelly - the only way to verify the test is to write the code again twice. Whilst measure twice cut once works for carpentry, it doesn't feel right for programming...

Enter an alternative testing framework, QuickCheck. The idea is simple, instead of testing arbitrary assertions about your code, specify the invariants associated with your function and let QuickCheck see if it can generate a failing test case.

As a simple example, let's say we write a function to add two numbers together:


  addNum :: (Num a) => a -> a -> a
  addNum a b = a + b

  prop_AddNum a b = (addNum a b) >= b && (addNum a b) >= a

We specify the invariant that if we add numbers together the result is bigger than either argument. Running Quick Check shows that this is (obviously wrong!) and gives an example set of arguments that fail the test.


  *Main> quickCheck prop_AddNum
  Falsifiable, after 5 tests:
  -1
  -2

The convention is that the invariants are usually specified as beginning with "prop_" in case you're wondering where the weird naming comes from.

QuickCheck generates random instances satisfying the types and validates the properties. Generators exist for the basic types and can be extended to your own.

Taking an example from the ray tracing functions we can specify an invariant that the distance between any two points is constant after a linear transform.


  square :: (Num a) => a -> a
  square x = x * x

  distance :: Point -> Point -> Float
  distance p1 p2 = sqrt(square ((x p1)-(x p2)) + square ((y p1)-(y p2)))

  prop_distance :: Point -> Point -> Float -> Float -> Bool
  prop_distance p1 p2 d1 d2 = 0.001 > abs (distance p1 p2 - 
                                           distance (Point ((x p1) + d1) ((y p1) + d2))
                                                    (Point ((x p2) + d1) ((y p2) + d2)))

Note that the abs is just to deal with rounding errors that occur when dealing with floating point types results from the square root. The code won't compile as is, because QuickCheck doesn't know how to generate Point objects. We can solve this problem by creating an instance of Arbitrary specialized (is that the right word?) for Point types.


  instance Arbitrary Point where
      arbitrary = do
        x <- choose(1,1000) :: Gen Float
        y <- choose(1,1000) :: Gen Float
        return (Point x y)

do is used to provide sequencing of statements. We can now run quickCheck and verify that the invariant holds.


  *Main> quickCheck prop_distance
  OK, passed 100 tests.

I'm still not quite understanding some aspects of this (e.g. why can't I write Point choose(1,1000) choose(1,1000) instead of sequencing?), but this is a pretty neat way of writing tests and definitely gives me further reason to try and understand Haskell in more depth.

Wednesday 26 August 2009

Haskell Arrays

Arrays in Haskell:

may be thought of as functions whose domains are isomorphic to contiguous subsets of the integers.

Arrays in Java:

An array object contains a number of variables. The number of variables may be zero, in which case the array is said to be empty. The variables contained in an array have no names; instead they are referenced by array access expressions that use nonnegative integer index values. These variables are called the components of the array. If an array has n components, we say n is the length of the array; the components of the array are referenced using integer indices from 0 to n - 1, inclusive.

This probably sums up the difference between the languages very well. Haskell says it in one sentence, whereas Java waffles a little bit more!

In Haskell arrays are constructed with the array constructor. The first argument specifies a pair of bounds so we can create an array mapping the integers 3, 4 and 5 to 10 times their value.


  *Main> array (3,5) [ (i,i*10) | i <- [3..5]]
  array (3,5) [(3,30),(4,40),(5,50)]

Arrays can also be multidimensional in which case more bounds need to be provided.


  *Main> (array ((0,0),(1,1)) [ ((i,j),i*2+j) | i <- [0..1], j <- [0..1]]) 
  array ((0,0),(1,1)) [((0,0),0),((0,1),1),((1,0),2),((1,1),3)]

I struggled to get the PPM package in Hackage working last time because I didn't understand arrays in the slightest. Now that I have a little understanding (only a little!) I can actually visualize the ray tracing code...

One thing I'm still not understanding is how I should format Haskell code. Here's the code to render the image and save it as a PPM.


image :: [Sphere] -> Point -> Int -> Int -> Array (Int,Int) Int 
image world eye width height = 
    array 
      ((0,0),(width,height)) 
      [((i,j),truncate (255 * (value (colorAt world eye (fromIntegral i) (fromIntegral j))))) | 
        i <- [0..width], j<- [0..height]]

imageWord16 :: Array (Int,Int) Int -> Array (Int,Int) Word16
imageWord16 image = fmap (fromIntegral :: Int -> Word16) image

saveImage :: String -> [Sphere] -> Point -> Int -> Int -> IO ()
saveImage filename world eye width height = arrayToFile filename (imageWord16 (image world eye width height))

Note the horrible usage of fromIntegral and truncate to convert an integer to a float and back again. I think this is because I should have been more general in my types on the Point data type and specified it as a number rather than a float.

The type of saveImage looks a little funny but this is because it returns an IO action rather than a value. This value is a monad, but again I'll ignore this and hope repeatedly using them will lead to understanding! For now, I just grok that its type indicates it does something rather than returns something.

The finished code weighs in at about 100 lines, which is more or less exactly the same as Clojure, but with the advantage of static typing. I found static typing to be in equal measure incredibly useful and incredibly frustrating! Hopefully it'll lean towards useful as I understand things a bit more.

Tuesday 25 August 2009

Some Ray Tracing Functions in Haskell

I'm still trying to find my way in Haskell, and I've always thought the best way is to implement some real code that does something. I thought I'd port across the little Clojure ray tracer example that I ported from ANSI Common Lisp.

The first version I wrote handily ignored type definitions, used tuples everywhere. I didn't write out the type signatures of my functions with the net result that they often didn't do what I thought. Lesson learnt. Haskell is a strongly typed language so to work with it, I should use types! I used the following types which are hopefully self explanatory.


data Point = Point { x :: Float
                   , y :: Float
                   , z :: Float
                   } deriving (Show)
 
data Sphere = Sphere { color :: Float
                     , radius :: Float
                     , centre :: Point
                     } deriving (Show)
 
data ObjectHit = ObjectHit { object :: Sphere
                           , location :: Point
                           } deriving (Show)
 
data Brightness = Brightness { value :: Float } deriving (Show)

deriving Show is is used to say that this type definition is printable and should be printed out in the default way. The names of the types inside can be used as accessor functions (e.g. x p gives you the x co-ordinate of Point p).

Firstly we need some basics functions to compute various number and vector properties.


square :: (Num a) => a -> a
square x = x * x
 
magnitude :: Point -> Float
magnitude p = sqrt ((square (x p)) + (square (y p)) + (square (z p)))
 
unitVector :: Point -> Point
unitVector p = let d = magnitude p
               in Point ((x p)/d) ((y p)/d) ((z p)/d)
 
pointSubtract :: Point -> Point -> Point
pointSubtract p1 p2 = Point (x p1-x p2) (y p1-y p2) (z p1-z p2)
 
distance :: Point -> Point -> Float
distance p1 p2 = magnitude (pointSubtract p1 p2)
 
sphereNormal :: Sphere -> Point -> Point
sphereNormal s p = unitVector (pointSubtract (centre s) p)
 
lambert :: Sphere -> Point -> Point -> Float
lambert s i r = let n = sphereNormal s i
                in max 0 ((x r * x n) + (y r * y n) + (z r * z n))

The definitions of the next set of functions is slightly more interesting.


minroot :: Float -> Float -> Float -> Maybe Float
minroot a b c
    | a == 0 = Just ((- c) / b)
    | otherwise = let disc = (square b) - (4 * a * c)
                  in if (disc > 0)
                     then Just (min (((-b) + sqrt disc) / (2 * a)) (((-b) - sqrt disc) / (2 * a)))
                     else Nothing
 
sphereIntersect :: Sphere -> Point -> Point -> Maybe ObjectHit
sphereIntersect s pt r = let c = centre s
                             n = minroot (square (x r) + square (y r) + square (z r))
                                         (2 * ((x r * (x pt - x c)) + (y r * (y pt - y c)) + (z r * (z pt - z c))))
                                         ((square (x pt - x c)) + (square (y pt - y c)) + (square (z pt - z c)) - (square (radius s)))
                         in if (isNothing n)
                            then Nothing
                            else Just (ObjectHit
                                       s
                                       (Point
                                        ((x pt) + (fromJust n) * (x r))
                                        ((y pt) + (fromJust n) * (y r))
                                        ((z pt) + (fromJust n) * (z r))))
 
spheresHit :: [Sphere] -> Point -> Point -> [ObjectHit]
spheresHit sw pt r = mapMaybe (\x -> sphereIntersect x pt r) sw
                                
nearestHit :: [Sphere] -> Point -> Point -> Maybe ObjectHit
nearestHit sp pt r = let hitSpheres = spheresHit sp pt r
                     in
                         case hitSpheres of
                           [] -> Nothing
                           x -> Just (head (sortBy
                                             (\h1 h2 -> (compare (distance (location h1) pt) (distance (location h2) pt)))
                                             x))

Maybe is a type that might be null (Nothing in Haskell). The type system enforces that you handle both cases. This is useful for solving the quadratic equation (minroot) because we can indicate that an equation has no solution without having to resort to either exceptions or picking a sentinel value and hoping it never occurs. Similarly, spheresIntersect returns the intersection of a ray and a sphere, but that intersection might never occur, hence we can return Nothing. mapMaybe only performs the map if the element exists and throws out the results of anything with Nothing as a value.

Apparently Maybe is a monad too. I'm deliberately trying to avoid understanding the deep meaning of monad and trying to use it instead. There's already too many articles about monads!

Finally we can perform the actual ray tracing with:



sendRay :: [Sphere] -> Point -> Point -> Brightness
sendRay world src ray = let hit = nearestHit world src ray
                  in if (isNothing hit)
                     then (Brightness 0)
                     else let sp = object (fromJust hit) in
                          (Brightness ((color sp) * (lambert sp src ray)))
 
colorAt :: [Sphere] -> Point -> Float -> Float -> Brightness
colorAt world eye x y = let ray = unitVector (pointSubtract (Point x y 0) eye)
                        in (Brightness (255 * value (sendRay world eye ray)))

Now we get to the slight downside of Haskell (at least for a beginner like me). I have my code, and I'm confident it is correct - how do I write this image out as a PNG? Hackage lists lots of graphics packages, but they are all quite heavy going to a newbie to understand. At the moment I'm trying to write out a PPM file (the approach taken here), but it feels painful. Clojure doesn't suffer from this problem because Java has a library for everything (and I know Java!). I'm sure this is a temporary roadblock though, so time for me to try and find a suitable package on Hackage and understand enough bits and pieces to be able to use it!

Thursday 20 August 2009

Hoogle

Hoogle is a search engine for the Haskell API. As well as being able to search plain function names, you can also search function signatures. It's really cool!

The manual details how to get Emacs integration working (as well as integration with GHCi).

Monday 17 August 2009

More Basics of Haskell.

Last time I had a whistle-stop tour of some of interpreter and basic syntax of Haskell, but completely dodged the types.

One thing I missed before was comments. A single line comment begins with two or more consecutive -'s. Nested comments use "{-" and "-}" to end blocks.

Names of types are capitalized, such as Num and Bool. The names of values are not typed. True and False are types, not values. The :: notation should be read as "has type". Haskell uses type inference to infer types which is why we didn't need to give type definitions before. Haskell function definitions are curried so a function of N arguments is treated as N functions of one argument, hence f x y z = x+y+z would have type f :: (Num a) => a -> a -> a -> a.

For example, consider the function add5 which is a poor man's map, permanently specialized to add 5 to each element in the list:


  add5 [] = []
  add5 (x:xs) = (x+5):add5(xs)

Without any type definitions added the compiler infers the types as add5 :: (Num a) => [a] -> [a]. This reads as a function where a is of type Num returning a list of a. If I wanted to write the type definition explicitly I could add add5 :: [Int] -> [Int] which will now restrict it to only mapping from Int to Int. Attempting to use this on any other type will be an error. The type signature itself is also checked by the compiler, so putting something nonsensical here will be caught at compile time.

Putting this all together we can write map as:


  mymap :: (x -> y) -> [x] -> [y]
  mymap f [] = []
  mymap f (x:xs) = f x:mymap f xs

Function application has high precedence so this minimizes the amount of parenthesis you have to use (that's a difficult habit to break from Lispy languages).

On a side note, one of the neat features is pattern matching with a guard. The guard is specified after the pattern with the | (pipe) symbol (I read this as "where"). In the example below, the guard clause means we can choose whether to include things in the filtered list or not.


  myfilter :: (x -> Bool) -> [x] -> [x]
  myfilter f [] = []
  myfilter f (x:xs) | f x = x:myfilter f xs
                    | otherwise =  myfilter f xs

Sunday 16 August 2009

The (very) basics of Haskell

(just some notes as I plunder online tutorials to get the information I want on the most basic parts of Haskell).

Haskell's REPL is the ghci tool. It works in a similar manner to Clojure's. The "Prelude" prefix indicates the module that we are currently in (similar to the namespace idea in Clojure). Prelude is the standard Haskell library consisting of standard definitions for various types and functions.

:? (or :help) gives help on the available commands in the REPL. The most useful I've found so far are:

:quit - exit a GHCI session

:info - tells you about a specific type


  Prelude> :info True
  data Bool = ... | True  -- Defined in GHC.Bool

:type - gives you the type of the supplied type


  Prelude> :type 1 + 2
  1 + 2 :: (Num t) => t

Haskell uses infix (compared to Lisp's prefix) notation (though you can enclose an operator with parenthesis and make it prefix (e.g. (+) 1 1 is 2). A type is named with an initial capital letter, for example Boolean values use True and False. C style &&, == and || operators are used and do exactly what you'd expect! != is written /=.

The list data structure in Haskell is created using square brackets ([]). The biggest difference between this and Lisp lists is that the items in the list must be of the same type. For example, lists of numbers are fine but mixing numbers and characters is forbidden (tuples are the solution to this).


  Prelude> :type [1,2]
  [1,2] :: (Num t) => [t]

  Prelude> :type [1,2.0]
  [1,2.0] :: (Fractional t) => [t]

  Prelude> :type [1.0,2.0]
  [1.0,2.0] :: (Fractional t) => [t]

  Prelude> :type [1,"a"] 
  :1:1:
    No instance for (Num [Char])
      arising from the literal `1' at :1:1

head and tail can be used to get (as you'd expect) the first and rest of the list (car / cdr) (head [1,2,3] ==> 1, tail [1,2,3] ==> [2,3]). Performing head/tail on an empty list raises an exception. ++ is for appending lists together ([1,2,3] ++ [4,5,6] ==> [1,2,3,4,5,6]). The number of items in a list can be determined using length. A range of numbers can be created with the .. notation (e.g. [1..5] ==> [1,2,3,4,5]). If the first two elements are provided then this gives the "step" for the range (e.g. [1,3..10] ==> [1,3,5,7,9]).

To define functions within GHCI, use let to introduce a name.


  Prelude> let myadd x y = x + y
  Prelude> myadd 7 14
  21

This doesn't scale very nicely (you have to use :{ and :} directives in order to have a multiple line function), so the preferred way is to put it in a file and use :load to bring it in. Note that the function is compiled as it is loaded.

Another key feature of Haskell is pattern matching. Multiple function definitions can be written and the body is executed for the pattern that it matches. For example, we can define our own length function as:


  mylength [] = 0
  mylength (x:xs) = 1 + mylength xs

This is a recursive function, so what happens to the stack? Running mylength [1..10000000000] results in a stack overflow, why's that? This is because of laziness and is covered in detail here. The problem is that because Haskell is a lazy language the computation is only evaluated when needed, this builds up a giant "thunk" (pending computations) for all the items in the list which in turn requires them to all be in memory. Laziness is hard to reason about - practise will make perfect.

List comprehensions provide a way for a list to be generated from a series of functions. In the example below <- is a generator function meaning use one of these values to substitute in for each value.


  Prelude> [(x,y) | x <- [1,2,3], y <- [4,5,6]]
  [(1,4),(1,5),(1,6),(2,4),(2,5),(2,6),(3,4),(3,5),(3,6)]

List comprehensions are evaluated left to right, so in the example above x is fixed until all values of y have been exhausted. Hence don't make y an infinite range otherwise it'll never get to the next value of x...

Next on the list, exploring the basics of types!

Time for a change!

Over the last 8 months or so, I've learned loads about functional programming and Clojure in particular. I've implemented some toy solutions to problems (Game of Life, Ray Tracing and so on). But one thing Clojure misses is strong typing and I don't know whether I should care about that or not!.

The Type System article on Wikipedia gives a good overview of the various type systems. For the toy examples I've written so far, I can honestly say that a lack of compile time checking wasn't a problem, however it seems only fair to explore the other side too. At least then I can contribute to the debates having seen both sides. So this isn't good bye to Clojure, it's peeking over the fence to see if the grass is greener (for me) on the static side.

In order to explore type systems I had a number of options including Scala, Objective Caml or Haskell. I decided to go for Haskell on the basis that it's different (it's not on the JVM) and it seems to have a healthy community behind it.

Haskell now has a standard distribution which includes a bunch of standard libraries for things like IO, Graphics and so on. Installing this on Ubuntu was dead simple following the instructions here.

Onwards into a brave new world!

JVisual VM Blogging Competition

Woo, I actually won something!

The slight downside is that a $50 American Express voucher is a little tricky to spend in the UK....

Friday 7 August 2009

Man or Boy Test

I have written the following simple routine, which may separate the 'man-compilers' from the 'boy-compilers'
— Donald Knuth

The Man or Boy Test was created to beat Algol 60 compilers into submission as at the time there were apparently few compilers which implemented recursion and non-local references properly.

Translating the code into Clojure proved pretty challenging!

This can be compared against the Lisp version on Rosetta Code. Implementing this required finding letfn which allows you do declare local functions that can refer to themselves (e.g. recursive).

The local function b refers to itself. This torturous structure means that b will appear as all the positions of x1 to x5 arguments during evaluation. b can only be evaluated when it is in position x4 or x5. At this point, the evaluation of b changes the variable k which should be reflected in the recursive call that follows AND any further evaluation within that call frame.

For this reason, k needs to be mutable, hence it's an atom. The first 16 values are shown below.


user> (map man-or-boy (range 16))    
  (1 0 -2 0 1 0 1 -1 -10 -30 -67 -138 -291 -642 -1446 -3250)

Now for the 17th value the stack overflows. I can't quite work out how to avoid that (yet).

Thursday 6 August 2009

Less Lazy about Understanding Laziness

A recent post by Uncle Bob made me realize how little I understand about laziness and how it affects the computational complexity of a solution. The quote below from the post sums it up well:

You don’t get the benefit of lazy sequences unless you use the lazy sequences. Creating a brand new one each time may generate the right answer, but it’s lazy, and can burn a lot of execution cycles.

Two macros in clojure.contrib.seq-utils help to understand this in more detail. rec-seq and rec-cat that are used to create sequences that refer to themselves without generating new copies each time.

You can use these macros to efficiently calculate values with lazy sequences. For example, the Lucas Numbers can be calculated as:


  user> (take 10 (rec-cat s [2 1] (map + s (next s))))
  (2 1 3 4 7 11 18 29 47 76)

So how do rec-seq and rec-cat hold onto the same copy of the lazy sequence without creating new ones? An atom (not sure why it's not a local var?) holds onto the current sequence for the whole function, swapping the value as necessary.

So the atom is set to the next value of the body (in this case map). Since the next value of the map expression is defined in terms of itself this sets up a recursive definition which, because of the atom and reset! uses the same lazy sequence for the whole duration.

I think that's going to take a little while to sink in...

Tuesday 4 August 2009

Arc's Accum Macro and Transient

Arc is Paul Graham's shot at writing a new Lisp. Recently there was a comment on Y Combinator about how Clojure's transients would enable writing an efficient version of Arc's accum macro. Original comment here.

Having never used Arc, I looked up the definition in the latest release of the source.

The code is very readable and it's clear to see that this creates an accumulator (with a unique symbol) and names the symbol the user provides as an function that pushes onto the accumulator. This means that you can write simple functions that iterate over data and just keep accumulating results. For example, the functions to get the keys and the maps of map are defined as iteration over the key value pairs and pushing the values into the accumulator.

So, how do we translate this into Clojure, and how do transients help?

The first version I did had a subtle bug in it; relying on the fact that the accumulators identity is persistent. This is an implementation detail and shouldn't be relied on. This unfortunately results in the messy with-local-vars. This is definitely very verbose compared to the Arc code! Is there a better way to do it?

The definition is (obviously) very similar to Arc's code (they have common ancestry). The reader macro for unique symbol definition (trailing #) feels more concise than the w/uniq of Arc (but perhaps I'm missing something?).

Once you've seen this code it becomes clear why transient data structures help. Without them, it'd be inefficient otherwise, creating additional copies of the structure. This way around we're making minimal changes to the data and creating less garbage.

Even from reading a tiny bit of Arc it's clear there's lots to learn from the source and ideas of structuring code / macros. Time to do learn some more Arc me thinks!

Monday 3 August 2009

Clojure Transients

Clojure has recently added a new of concurrency primitive, transients. A transient collection provides you with a way of doing controlled mutation for efficiency. The rationale sums it up well:

If a tree falls in the woods, does it make a sound?

If a pure function mutates some local data in order to produce an immutable return value, is that ok?

The goal is to tackle situations in which you are mutating data internally. As an example, consider a version of map that works on vectors. Internally you build up a new vector, constructed piece meal from the old vector. Something like this:

Now the problem with this is that I'm creating lots of garbage. Under the hood the garbage collector is having to create and destroy many vector objects (one per loop). This overhead can be significant.

One of the key aspects of functional programming is referential transparency. Given the same arguments to the same function, you'd expect to get the same results each time. As long as we don't break referential transparency, we can still consider the program in functional terms.

Transient allows you to do this in Clojure. Writing the transient version of the map function is simple and it follows exactly the same structure as before.

transient constructs a mutable version of a collection in O(1) time. A transient collection can only be used with supported functions. To turn it back into a standard list, you use persistent!.


  user> (map inc transient [1 2 3 4])
  ; Evaluation aborted.
  user> (map inc (persistent! (transient [1 2 3 4])))
  (2 3 4 5)

Once a collection has been persisted, it can no longer be mutated.


  user> (let [a (transient [1 2 3])]
          (conj! a 4)
          (persistent! a)
          (conj! a 5))
  ; Evaluation aborted.
  ;   Mutable used after immutable call

Transient collections are only mutable from the thread that created them. Enforcing thread isolation eliminates a large source of errors. A transient can be mutable for however long you like, without locking, until you call persist!. Compare this to Hashtable and Vector in Java, where each individual method is synchronized.

So how does this affect performance? Using (vector-map-slow inc (vec (range 1000000))) takes about 550 milliseconds on my box (averaged after a few runs). In comparison, using vector-map-transient takes about 300 ms. That's a huge improvement for a micro-benchmark.

There's a Paul Graham quote somewhere about Lisp being two languages, one for writing code fast, and one for writing fast code. I think that's now becoming true for Clojure too. Work purely functionally in the prototyping stage, and if you do need the performance this is another tool in your arsenal.

Saturday 1 August 2009

Iterated Function Systems

An Iterated Function System (IFS) is a technique for creating fractal images. The name comes from the simple way in which they work. Start at an arbitrary point and colour it in, perform a transformation of that point to some new point. Colour that in and repeat until done. The translation function typically chooses between one of any number of choices. This is also called the Chaos Game.

The classic example if the Sierpinski Triangle (which incidentally has a much cooler Clojure implementation [using a different technique] in 3d here). My 2D rendering kind of pales in comparison!

This flavour of IFS is generated using a simple two dimensional affine transform. (this limits it to scale, rotation and translation, but not shearing transforms). The equation for this is shown below:

An implementation of this in Clojure is shown below. The formula above is expanded out in calculate-point

The second example draws a simple fern leaf pattern known as Barnsley's Fern.

The code to do the rendering uses a single agent to provide the animation (by repeatedly setting individual pixels). It's a bit yucky because I've no idea how to check the bounds of an arbitrary set of functions, so I make a guess by checking the first 10000 values to get an idea of min/max (get-bounds). The ugliness is compounded by the floor, abs and inc strategically located which get around the rounding errors. (big thanks to Hoeck from #clojure who suggested using take-while instead of the hideous gumph I had before which threw an exception to abort).

Obviously the time taken is dependent on how many equations are used, but images are generated very quickly. The get-bounds function on my box computes 10000 iterations of the Sierpinski set in 30 ms!

Full code available here in the misc folder.