Monday 3 August 2009

Clojure Transients

Clojure has recently added a new of concurrency primitive, transients. A transient collection provides you with a way of doing controlled mutation for efficiency. The rationale sums it up well:

If a tree falls in the woods, does it make a sound?

If a pure function mutates some local data in order to produce an immutable return value, is that ok?

The goal is to tackle situations in which you are mutating data internally. As an example, consider a version of map that works on vectors. Internally you build up a new vector, constructed piece meal from the old vector. Something like this:

Now the problem with this is that I'm creating lots of garbage. Under the hood the garbage collector is having to create and destroy many vector objects (one per loop). This overhead can be significant.

One of the key aspects of functional programming is referential transparency. Given the same arguments to the same function, you'd expect to get the same results each time. As long as we don't break referential transparency, we can still consider the program in functional terms.

Transient allows you to do this in Clojure. Writing the transient version of the map function is simple and it follows exactly the same structure as before.

transient constructs a mutable version of a collection in O(1) time. A transient collection can only be used with supported functions. To turn it back into a standard list, you use persistent!.

user> (map inc transient [1 2 3 4])
; Evaluation aborted.
user> (map inc (persistent! (transient [1 2 3 4])))
(2 3 4 5)

Once a collection has been persisted, it can no longer be mutated.

user> (let [a (transient [1 2 3])]
(conj! a 4)
(persistent! a)
(conj! a 5))
; Evaluation aborted.
; Mutable used after immutable call

Transient collections are only mutable from the thread that created them. Enforcing thread isolation eliminates a large source of errors. A transient can be mutable for however long you like, without locking, until you call persist!. Compare this to Hashtable and Vector in Java, where each individual method is synchronized.

So how does this affect performance? Using (vector-map-slow inc (vec (range 1000000))) takes about 550 milliseconds on my box (averaged after a few runs). In comparison, using vector-map-transient takes about 300 ms. That's a huge improvement for a micro-benchmark.

There's a Paul Graham quote somewhere about Lisp being two languages, one for writing code fast, and one for writing fast code. I think that's now becoming true for Clojure too. Work purely functionally in the prototyping stage, and if you do need the performance this is another tool in your arsenal.