Fatvat: May 2009

Friday 29 May 2009

JVisualVM and Clojure

JVisualVM is a Java Virtual machine monitoring tool that allows you to see a live view of the Java memory, cpu and threads that are currently active. In this post, I'll look at how easy it is to profile Clojure code using JVisualVM.

First step is to get jvisualvm installed. If you've got a recent JDK then it's already available in the bin directory of the JDK install. If not, then upgrade to the latest JDK here. Once installed, fire it up and you should see the following.

The left hand side shows the applications you can connect to. You should see a VisualVM instance that allows you to monitor the VisualVM itself. Very meta. The local applications will include all locally running Java applications. Since Clojure itself is a Java application we can fire up a REPL and use that to profile code.

Firstly we'll look at garbage collection by using the memory profiler. People often associate immutable data structures (such as Clojure's) with poor performance since every time we modifying a data structure we are actually creating a completely new structure (in reality that's not quite true, structure that is unchanged is often shared). Using JVisualVM we can profile how fast the JVM can allocate objects. Let's use a really simple example which sums up the numbers 1 to 100K. Without JVirtualVM this takes 40ms. In those 40 seconds we create at least 100K objects.


  user> (time (reduce + (range 100000))
  "Elapsed time: 40.677795 msecs"
  4999950000

If we instrument this we can see that 165K java.lang.Integer objects are created together with 35K java.lang.Long objects. In this example then we're creating roughly 4 million objects per second! Pretty impressive (I realize this is totally non scientific, I'm just aiming to get a feel for the numbers!).

Another area in which you can apply JVisualVM is looking for optimizations and monitoring threads. In one of my previous posts, I looked at Mandlebrot fractal generation. We can use jvisualvm to understand the performance characteristics of the application. The only interesting part of the code for this experiment is shown below:



(defn calculate-pixels []
  (let [pixels (range 0 (* *width* *height*))]
    (pmap (fn [p] 
            (let [row (rem p *width*) col (int (/ p *height*))]
              (get-color (process-pixel (/ row (double *width*)) (/ col (double *height*))))))
          pixels)))

Running this with a single-threaded map instead of pmap gives us the following CPU information.

The time is dominated by mathematical operations. This is what you'd expect. You'd expect (naively) if you used multiple threads you'd be able to speed things up since the operations on each pixel are independent. What happens if we switch to pmap?

There's a big change. The time now is dominated by ThreadPoolExecutor$Worker which is taking up a huge percentage of the time. My guess is that we're context switching far too much - using jvisualvm we can get a view of what the threads were doing:

The image shows that the threads spent a lot of time waiting and not a lot of time doing processing. Opportunities for simultaneous processing were few because of the time spent spawning off tiny amounts of work.

Let's partition off chunkier bits of works in an effort to make the CPU work harder for its money. In the function below I use partition to break the set of pixels to render into 16 discrete chunks (number chosen at random). We use pmap again but this time it's got chunker bits of work to do. doall is used to force the evaluation of the inner map.



(defn calculate-pixels-2 []
  (let [n (* *width* *height*)
        work (partition (/ n 16) (range 0 n))
        result (pmap (fn [x]
                   (doall (map 
                    (fn [p]
                      (let [row (rem p *width*) col (int (/ p *height*))]
                        (get-color (process-pixel (/ row (double *width*)) (/ col (double *height*))))))
                    x)))
                   work)]
    (doall (apply concat result))))

The difference this makes is very visible. The graphic below shows two runs. Each time we can see that the threads are fully busy, not blocked and able to fully utilize the CPU. The worker threads are indicated at the bottom in the image below.

So using VisualVM helped quantify various improvements. Whilst premature optimization is the root of all evil, if you are going to optimize then it should be guided by a tool rather than your intuition (which if you are anything like me is more often than not crap).

Monday 25 May 2009

Neural Networks and Clojure

Neural Networks (ANNs) attempt to "learn" by modelling the behaviour of neurons. Although neural networks sound cool, there is no magic behind them!

Invented in 1957, by Frank Rosenblatt, the single layer perceptron network is the simplest type of neural network. The single layer perceptron network is able to act as a binary classifier for any linearly separable data set.

Single Layer Perceptron graphic from Wikipedia

The SLP is nothing more than a collection of weights and an output value. The Clojure code below allows you to create a network (initially with zero weights) and get a result from the network given some weights and an input. Not very interesting.



(defn create-network 
  [out]
  (repeat in 0))

(defn run-network
  [input weights]
  (if (pos? (reduce + (map * input weights))) 1 0))

The clever bit is adapting the weights so that the neural network learns. This process is known as training and is based on a set of data with known expectations. The learning algorithm for SLPs is shown below. Given an error (either 1 or -1 in this case), adjust the weights based on the size of the inputs. The learning-rate decides how much to vary the weights; too high and the algorithm won't converge, too low and it'll take forever to converge.



(def learning-rate 0.05)

(defn- update-weights
  [weights inputs error]
  (map 
   (fn [weight input] (+ weight (* learning-rate error input)))
   weights inputs))

Finally, we can put this all together with a simple training function. Given a series of samples and the expected values, repeatedly update the weights until the training set is empty.



(defn train
  ([samples expecteds] (train samples expecteds (create-network (count (first samples)))))
  ([samples expecteds weights]
     (if (empty? samples)
       weights
       (let [sample (first samples)
             expected (first expecteds)
             actual (run-network sample weights)
             error (- expected actual)]
         (recur (rest samples) (rest expecteds) (update-weights weights sample error))))))

So we have our network now. How can we use it? Firstly, let's define a couple of data sets both linearly separable and not. jiggle adds some random noise to each sample. Note the cool # syntax for a short function definition (I hadn't seen it before).



(defn jiggle [data]
  (map (fn [x] (+ x (- (rand 0.05) 0.025))) data))

(def linearly-separable-test-data
     [(concat
       (take 100 (repeatedly #(jiggle [0 1 0])))
       (take 100 (repeatedly #(jiggle [1 0 0]))))
      (concat
       (repeat 100 0)
       (repeat 100 1))])

(def xor-test-data
     [(concat
       (take 100 (repeatedly #(jiggle [0 1])))
       (take 100 (repeatedly #(jiggle [1 0])))
       (take 100 (repeatedly #(jiggle [0 0])))
       (take 100 (repeatedly #(jiggle [1 1]))))
      (concat
       (repeat 100 1)
       (repeat 100 1)
       (repeat 100 0)
       (repeat 100 0))])

If we run these in the REPL we can see that the results are perfect for the linearly separable data.


  > (apply train ls-test-data)
  (0.04982859491606148 -0.0011851610388172009 -4.431771581539448E-4)

  > (run-network  [0 1 0] (apply train ls-test-data))
  0

  > (run-network  [1 0 0] (apply train ls-test-data))
  1

However, for the non-linearly separable they are completely wrong:


  > (apply train xor-test-data)
  (-0.02626745010362212 -0.028550312499346104)

  > (run-network [1 1] (apply train xor-test-data))
  0

  > (run-network [0 1] (apply train xor-test-data))
  0

  > (run-network [1 0] (apply train xor-test-data))
  0

  > (run-network [0 0] (apply train xor-test-data))
  0

The neural network algorithm shown here is really just a gradient descent optimization that only works for linearly separable data. Instead of calculating the solution in an iterative manner, we could have just arrived at an optimal solution in one go.

More complicated networks, such as the multi-layer perceptron network have more classification power and can work for non linearly separable data. I'll look at them next time!

Friday 22 May 2009

How to improve coding

Reddit has gone downhill (it's almost as bad as Digg), but there are still some insightful comments. Take this answer regarding how to become better at programming:

"Buy lots of cans of paint and start doodling. Only artsy types can be true hackers.

Learn everything there is to know about manhole covers. This shows you have vital knowledge outside of programming.

Study to be an architect, find patterns everywhere and name them. Write a programming book about it and bask in the fame.

Become proficient in programming languages you will never use seriously. Only the best can afford to waste time so.

Keep up with the fads of the day while holding on to religious tenets of the past. You will be flexible yet dependable.

Read a book about motorcycle maintenance and see its deep connections to programming. You will seem mysterious and wise.

Know deep in your heart that you're in the top 10% of all programmers. Superiority is a state of mind.

When you once get the hang of programming, you know everything. There is no elephant, only tree trunks all alike."

This touches on most of the memes regarding improving as a programmer that I've seen. Hackers and Painters, crazy interview questions, read SICP, extreme programming and the 10x gap in performance.

Monday 18 May 2009

Clojure and Robocode

Robocode is an educational programming game originally provided by IBM. Users write code to control miniature tanks in battles with other user written robots. Each tank has the same components but different programming. Users write code to control the movement, targeting and radar features of their robot.

Robocode uses Java as the programming language, and robots are shared by packaging them into Jar files. This means we should, with very little effort, be able to use Clojure to write a robot. Start by downloading the latest version of Robocode, available here.

The first fly in the ointment is that Robocode restricts the use of third-party JARs. If you try to just add clojure.jar to the class path you'll get an error message like this:


Preventing uk.co.fatvat.robot.FatRobot (1) from access: (java.io.FilePermission clojure.jar read): 
You may only read files in your own root package directory. 
SYSTEM: An error occurred during initialization of uk.co.fatvat.robot.FatRobot (1)
SYSTEM: java.lang.ExceptionInInitializerError

To fix this, edit the start up script (robocode.bat or robocode.sh depending on whether you are running Windows or not) and make sure you disable the security manager by adding -DNOSECURITY=true to the startup line. This disables the security manager meaning that there are no restrictions on what the robot can do. The security manager is in place in case you are a sentient robot killing machine. Be warned.

Without further ado, here's the most basic robot imaginable, borrowed entirely from My First Robot and converted over to Clojure. This robot moves back and forth, turns the gun around and fires at anything that looks at him slightly funny.


(ns uk.co.fatvat.robot.FatRobot
  (:gen-class :extends robocode.Robot))

(defn -run
  [robot]
  "Infinite loop whilst robot is alive"
  (doto robot
    (.ahead 500)
    (.turnGunRight 360)
    (.back  500))
  (recur robot))

(defn -onScannedRobot
  [robot event]
  (doto robot
    (.fire 1)))

It's not exactly the most exciting robot in the world and even loses against a robot that sits still and turns the turret around! How can we make it a little bit cleverer?

The robot below extends from AdvancedRobot which allows non-blocking calls, writing to the file system and custom events.


(ns uk.co.fatvat.robot.NotQuiteAsBad
  (import (java.awt Color))
  (import (robocode Rules))
  (import (robocode.util Utils))
  (:gen-class :extends robocode.AdvancedRobot :init create-robot :state state))

This shows two bits of the :gen-class directive I haven't used before. :init allows you to specify a constructor routine. The return value for this function is unusual in that it is always a vector with two elements. The first represents any arguments to the superclass (in this case empty) and the second represents any state that the object needs. The :state keyword allows you to name the method of accessing the state.


(defstruct target-details :distance :bearing :energy :velocity)

(defn -create-robot
  []
  "Robot records a list of events and performs actions based on these observations"
  [[] (ref [])])

The constructor function simply returns a reference to the state which is initially empty. We also define a structure representing a sighting of the enemy. These sightings will be logged in our state and will be used to determine the strategy.


(defn -run
  [robot]
  "Infinite loop whilst robot is alive"
  (setup-robot robot)
  (loop [x 1] ;; TODO is there a better idiom for an infinite loop?
   (walk robot)
    (recur 1)))

(defn -onScannedRobot
  [robot event] 
  (let [distance (.getDistance event)
        name (.getName event)
        energy (.getEnergy event)
        velocity (.getVelocity event)
        bearing (.getBearing event)]
    (dosync
     (alter (.state robot) conj (struct target-details distance bearing energy velocity)))
    (attack robot)))

onScannedRobot now records the details of the last observation (dosync sets up the transaction, alter applies the function given (conj) with the given arguments.


(defn- setup-robot
  [robot]
  "Ensure robot looks pretty"
  (doto robot
    (.setAdjustRadarForGunTurn true)
    (.setColors Color/RED Color/BLACK Color/RED)))

(defn- attack
  [robot]
  "Based on the accrued events, hurt robots"
  (let [latest (last @(.state robot))]
    (.turnRight robot (get latest :bearing))
    (when (zero? (get latest :velocity))
      (.fire robot 3))
    (.setTurnRadarRight robot 360)))

(defn- walk
  [robot]
  "Go for a walk around the outside of the building"
  (let [x (mod (.getHeading robot) 90)]
    (.ahead robot 50)
    (when (not (zero? x))
      (.turnLeft robot x))
    (when (zero? (.getVelocity robot))
      (.turnRight robot 90))))

walk simply makes the robot run around the outside of the arena. This is heavily based on the example code here. attack just waits for a stationary robot, turns and fires!

This at least allows me to beat the Fire default robot 10-0 (most of the time!).

There are a load more strategies and patterns available on the Robocode Wiki.

This was quite a useful learning exercise because I found out about init and state. One problem I ran into was that init doesn't allow you to call methods on the object that is about to be constructed. This was fixed recently, and now there is a corresponding post-init hook.

You can find all the code on GitHub.

Sunday 17 May 2009

Back of the Envelope

"Programming Pearls" is a fantastic book by John Bentley. It covers a wide range of topics from low-level techniques on space-saving to high-level advice on problem solving.

My favourite chapter is called "The Back of the Envelope" which gives great advice in providing quick estimates to real problems. The importance with these rules of thumb is that it's about providing (quickly) an accurate guess.

Enrico Fermi was a master of these back of the envelope calculations. Legend has it that at the testing of a nuclear weapon as part of the Manhattan project, Fermi estimated the yield of the explosion by tossing small pieces of paper in the air and seeing how far they travelled.

Estimation is also a popular interview question. The routine is that the interviewer gives you an odd-ball question (how many gas stations are there in the UK?) and then wants to trace your back of the envelope calculation through (e.g. there are 70 million people in the UK, about 1 in 5 has a car [and so on!]).

Here's some useful bits of knowledge to have for estimation problems. Obviously knowing some maths helps too!

Rule of 72

If you invest a sum of money for y years at an interest rate of r percent, then if r * y = 72 your money will roughly double. For example, 9 years at 8 percent interest will double your money. For example, $2000 for 8 years at 9% yields $3985.

Pi seconds is a nanocentury

There are 3.155 * 10e7 seconds in a year or expressed (by Tom Duff) pi seconds is a nanocentury. For example, how long will it take you to execute 500 million runs of a simulation, where each run takes 1/8 seconds? (50M * 0.125 is about 62.5 million seconds). 62.5 x 10^6 divided by 3.14 X 10^7 is going to be about 2 years.

Of course you could just plug this into Wolfram Alpha and see that it comes out as 1 year, 11 months, 23 days and 20 hours. Near enough! Couple this with the timings for various operations (see Teach Yourself Programming in Ten Years) and you have a technique for estimating expected timing of an operation.

Little's Law

The average number of things in the system is the product of the average rate at things leave the system and the average time each one spending in the system. We can use this to model queue sizes (and not just in computers as the Wikipedia article shows). For example, let's say we're writing a video-processing system. If we process 10 videos per hour, and each one takes 30 minutes to be processed, then the average number of videos in the system at any one time is about 5.

Pareto Principle

80% of the effects come from 20% of the causes. Applies to a wide variety of problems, such as optimization (profile the code to find the 20% slowest functions and you'll get most of the gain), software design (design for the first 20% and 80% of your users will be happy) and more.

Amdahl's law

Amdahl's law states that the speed-up of a parallel program is limited by the sequential part of the program. For example, you have a program that takes 30 months to run in its current form. You've estimated 50% serial and 50% parallel, can you finish it in 6 months with 8 cores? (No, it's bounded by the serial). How many cores to finish it in 18 months? (answer 5)

Does anyone have any more to suggest?

Saturday 16 May 2009

Using Clojure and Ext/JS

Ext/JS is a popular framework for building Ajax interfaces. The model for Ajax applications is dead simple. Build a UI using JavaScript, and populate it using JSON retrieved from the server (I'm pretty sure there's more to it, but I like simple!).

JavaScript and Lisp languages should go together pretty well, Doug Crockford has described JavaScript as Lisp in C's clothing.

I haven't done enough JavaScript programming to be dangerous just yet, so I'll start with something simple from the Ext tutorial. The goal will be to display a data-grid populated by stuff retrieved using the persistence API I looked at previously. Here's the end goal:

Ext Datagrid populated with data from Clojure

.

Firstly we need to write a servlet that will retrieve all the stories. If we were going to do this properly we'd parametrize this with restrictions (find me all stories between X and Y, or retrieve the top 100 stories). I'm confident that no-one will ever use this site, so I'll just go for brute force retrieve everything!

Using the persistence API we prepare a query object to find everything matching the given type and lazily convert this (map) each element across using entity-as-map


(defn get-all
  [type]
  (let [query (Query. (str type))
        service (DatastoreServiceFactory/getDatastoreService)]
    (map entity-to-map (.asIterable (.prepare service query)))))

Now we can retrieve stories, we need to squirt this back as JSON. There's a package, clojure-json that does this for you, but I (foolishly) decided to do it the quick and dirty way and print out a string!


(ns news.liststory                                    
  (:use (news appengine))
  (:gen-class :extends javax.servlet.http.HttpServlet))

(defn story-to-json
  [s]
  (str "{\"title\":\"" (get s "title") "\"" ",\"body\":" "\"" (get s "body") "\"},"))

(defn -doGet
  [_ request response]
  (.setContentType response "text/xml")
  (let [w (.getWriter response) stories (get-all "story")]
    (.println w (str "{\"totalCount\":" (count stories) ",\"stories\":["))
    (doseq [story stories]
      (.println w (str \tab (story-to-json story))))
    (.println w (str "]}"))))

So what were aiming to print out is a JSON representation of the stories that looks a little like this.


{"totalCount":3,"stories":[
 {"title":"Home of the Clojure Language","body":"http://www.clojure.org/"},
 {"title":"Jeff's web page, full of rubbish!","body":"http://www.fatvat.co.uk"},
 {"title":"Wikipedia on Clojure","body":"http://en.wikipedia.org/wiki/Clojure"},
]}

Finally, we need something to render this. I took an example as a starting point and ended up with this. (Note that it's referencing local host because I'm running off a local development environment)



Ext.onReady(function(){
  var store = new Ext.data.JsonStore({
    root: 'stories',
    totalProperty: 'totalCount',
    idProperty: 'storyId',
    fields: [
      'body', 'title'
    ],

    proxy: new Ext.data.HttpProxy({
      url: 'http://localhost:8080/liststory?'
    })
  });

  var grid = new Ext.grid.GridPanel({
    el: 'story-grid',
    title: 'Clojure News!',
    store: store,
    loadMask: true,
    height: 400,
    columns:[{
      header: 'Link',
      dataIndex: 'body'
    },{
      header: 'Description',
      dataIndex: 'title'
    }]
  });

  grid.render();
  store.load({});
});

That was pretty painless to build. The painful bit was writing out JSON as a string. Lesson learnt, use a library (or build a decent abstraction).

I'm not sure I like mixing my languages, I'd really like a way to have Lisp goodness across the whole stack. One potential option for this is to use Google Web Toolkit. GWT compiles Java into cross-platform Javascript. I could (probably) have Clojure compile to Java which in turn is compiled to Javascript. That sounds fun!

Friday 15 May 2009

Package Management in Emacs

Emacs Lisp Package Archive is a add-on manager for Emacs. Hadn't seen it before. It's incredibly simple to install.


  (let ((buffer (url-retrieve-synchronously
        "http://tromey.com/elpa/package-install.el")))
  (save-excursion
    (set-buffer buffer)
    (goto-char (point-min))
    (re-search-forward "^$" nil 'move)
    (eval-region (point) (point-max))
    (kill-buffer (current-buffer))))

Evaluate this in Emacs (v.22 or greater), default key is Ctrl+J. This'll retrieve the el file and then evaluate it. Restart (or re-evaluate your .emacs) file and then M-x RET package-list will give you a list of available packages. M-x RET package-install RET name of package will download and install the package.

I used this to get the highlight-parentheses package (useful when you program in Lisp!). Other cool add-ons include:

nxml-mode - XML editing
JS2-mode - JavaScript editing

Wednesday 13 May 2009

Data Persistence in GAE with Clojure

If you want to persist stuff in Java, you've got a bewildering amount of choice

Java Data Objects

Java Persistence API

Hibernate

iBatis

JDBC

There's even an entire book about making the right decision! (Persistence in the Enterprise)

Google App Engine has gone with JDO using the Data Nucleus platform. In GAE this is split again into two APIs, the high-level one for persisting objects, and a lower-level one which allows you to persist raw data.

When using Clojure it makes more sense to go with the lower-level api. The higher-level one would require using annotations on objects which isn't supported at the moment in Clojure (as far as I know!).

So how do we store data in GAE? The example below saves a story to disk (as a learning exercise, I'm writing a quick and dirty reddit clone).


(ns news.savestory                                    
  (:use (news appengine))
  (:gen-class :extends javax.servlet.http.HttpServlet)
  (:import (com.google.appengine.api.datastore DatastoreServiceFactory Entity Key Query)))
                                      
(defn store
  [data type]
  (let [entity (Entity. (.toString type))]
    (doseq [[k v] data]
      (.setProperty entity (.toString k) v))
    (.put (DatastoreServiceFactory/getDatastoreService) entity)
    (.getKey entity)))

(defn -doGet                                                  
  [_ request response]
  (let [body (.getParameter request "storyLink")
        title (.getParameter request "storyTitle")]    
    (let [w (.getWriter response)]
      (.println w (store {:body body :title title} :story)))))

store takes a map and a type and persists it in the database and returns the key associated with this entity. Visiting the URL persists the data in the URL and returns the key.

Retrieving the data is much the same.


(ns news.viewstory                                    
  (:use (news appengine))
  (:gen-class :extends javax.servlet.http.HttpServlet)
  (:import (com.google.appengine.api.datastore DatastoreServiceFactory Entity Key Query KeyFactory)))

(defn entity-to-map 
  [entity]
  (into (hash-map) (.getProperties entity)))
                                      
(defn getEntity
  [id type]
  (let [k (KeyFactory/createKey (.toString type) (Long/valueOf id))]
    (entity-to-map
     (.get (DatastoreServiceFactory/getDatastoreService) k))))

(defn -doGet                                                  
  [_ request response]
  (let [id (.getParameter request "storyId")
        story (getEntity id :story)
        w (.getWriter response)]
    (doseq [[k v] story]
      (.println w (str k "=" v)))))

entity-to-map just converts the properties of the entity into a friendly Clojure type.

So now I know how to authenticate users, next step is to get some basic UI together. There's a number of choices here (JSP, server side HTML generation in Clojure or just go with Ajax). I'm leaning towards the Ajax!

Tuesday 12 May 2009

Joel on Software Dev Day in London

I've been looking for interesting events in the UK for hackers. Not found much so far!

I tried the Cambridge SPA group. Only managed to get to one meeting. There was a good talk on Software Decay, and an cringe-inducing moment as one of the participants suggested that older people couldn't program, only to find himself in the same room as Tony Hoare. The rest of the programme unfortunately doesn't sound as interesting.

There's also Software East, but that seems to have more of a process theme, rather than in-depth technical gubbins. Not sure what else there is in the East of England.

Today, Joel on Software announced that there would be a Stack Overflow event in London in October. Sounds exactly what I'm after, so I've signed straight up!

The conference is for programmers. The conversation is going to be hard core. Speakers are going to be writing code.

Perfect!

Sunday 10 May 2009

User authentication in GAE

Google App Engine provides a set of standard APIs for common tasks, such as user authentication, caching and persistent storage. This post looks at the user authentication API and creates a simple form that is authenticated against a Google login.

We'll use exactly the same structure and build scripts as the previous post, as that just makes life easier.

The servlet below checks whether there is an current user logged in. If not, then redirect to a prompt requiring a user to login, otherwise just display the users nickname.


(ns blogging.login                                                                        
  (:gen-class :extends javax.servlet.http.HttpServlet)
  (:import (com.google.appengine.api.users User UserService UserServiceFactory)))         
                                      
(defn greet 
  [user response]
  (.setContentType response "text/plain")
  (let [w (.getWriter response)]
    (.println w (str "Hello, " (.getNickname user)))))
                                                                                 
(defn -doGet                                                  
  [_ request response]
  (let [userService (UserServiceFactory/getUserService)
        user (.getCurrentUser userService)]
    (cond
      (not (nil? user)) (greet user response)
      :else (.sendRedirect response (.createLoginURL userService (.getRequestURI request))))))

If you deploy on the development server, then you get a screen like that shown below:

Neat! Next on the list, persisting data.

Clojure on the Google App Engine

The Google App Engine offers a complete stack for deploying applications in the cloud. Initially, support only existed for Python, but recently support was announced for Java.

Although, the main page announces this as support for the Java Language, it's much more than that, it's support for the Java Virtual Machine. The list of languages on the JVM is huge. This means that in theory, any of these languages can now be hosted in the cloud.

So how easy is it to get Clojure going in the cloud?

Firstly, register at http://appengine.google.com and get yourself an account. Download the Java AppEngine SDK too and unpack that and get the development server up and running.

GAE is based on the Servlet 2.5 specification, so the typical directory structure looks very similar to any JSP/Servlet type application:

Root - Top level directory for the project
Root/src - Source code
Root/war - JSP files and HTML artefacts
Root/war/WEB-INF -Application configuration files
Root/war/WEB-INF/classes - Compiled source files
Root/war/WEB-INF/lib - Deployment time dependencies

As GAE is based on servlets, we need to define a simple servlet for the mandatory hello world demo! This code goes in the src directory:


(ns helloclojure.servlet                                                                        
  (:gen-class :extends javax.servlet.http.HttpServlet))             
                                                                                                                              
(defn -doGet                                                  
  [_ request response]                                                                                                        
  (let [w (.getWriter response)]                                                                 
    (.println w "Hello world!")))

:gen-class causes Clojure to emit a class file representing this name space. The - before doGet indicates that this is a member function with three arguments (the first representing "this", unused and therefore a _ symbol). So all we do for this servlet is write "hello world" whenever any request is made.

Next we need to make a standard web.xml descriptor and put that in META-INF. This registers the servlet and specifies the mapping between a URL format and the servlet that deals with the request.



<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE web-app PUBLIC
 "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
 "http://java.sun.com/dtd/web-app_2_3.dtd">

<web-app xmlns="http://java.sun.com/xml/ns/javaee" version="2.5">
    <servlet>
        <servlet-name>helloclojure</servlet-name>
        <servlet-class>helloclojure.servlet</servlet-class>
    </servlet>
    <servlet-mapping>
        <servlet-name>helloclojure</servlet-name>
        <url-pattern>/helloclojure</url-pattern>
    </servlet-mapping>
    <welcome-file-list>
        <welcome-file>index.html</welcome-file>
    </welcome-file-list>
</web-app>

Also in the META-INF directory we need a descriptor for the GAE.



<?xml version="1.0" encoding="utf-8"?>
<appengine-web-app xmlns="http://appengine.google.com/ns/1.0">
    <application></application>
    <version>1</version>
</appengine-web-app>

That's all the scaffolding you need. Finally, we need some way to build and deploy this. Thankfully, someone has already looked at this, so I took their build script and made a few modifications (remove the test target for example) in the name of simplification.

The net result is the following Ant script.


<project name="helloclojure" basedir="." default="compile">
 
  <property environment="env" />  
  <property name="sdk.dir" location="/home/jfoster/appengine-java-sdk-1.2.0" />
  <property name="classes.dir" value="war/WEB-INF/classes" />
  <property name="lib.dir" value="war/WEB-INF/lib" />
  <property name="src.dir" value="src/" />
  
  <import file="${sdk.dir}/config/user/ant-macros.xml"/>
  
  <path id="project.classpath">
    <pathelement path="${classes.dir}" />
    <pathelement path="${src.dir}/helloworld" />
    <fileset dir="${lib.dir}">
      <include name="**/*.jar" />
    </fileset>
    <fileset dir="${sdk.dir}/lib">
      <include name="**/*.jar" />
    </fileset>
  </path>
  
  <target name="clean">
    <delete dir="${classes.dir}" />
  </target>
  
  <target name="init">
    <mkdir dir="${classes.dir}" />
  </target>
  
  <target name="compile" depends="clean,init">
    <java classname="clojure.lang.Compile" classpathref="project.classpath" failonerror="true">
      <classpath path="${src.dir}" />
      <sysproperty key="clojure.compile.path" value="${classes.dir}" />
      <arg value="helloclojure.servlet" />
    </java>
  </target>
  
  <target name="devserver" description="run local dev appserver" depends="compile">
    <dev_appserver war="war" />
  </target>
  
  <target name="deploy" description="deploy to appspot" depends="compile">
    <appcfg action="update" war="war" />
  </target>
  
</project>

All code is available on my Git Repo

Saturday 9 May 2009

Stovepipe Development

Whilst reading The Art of Agile Development (great book so far), I came across a term I'd not seen before, stove pipe development. Apparently it's:

The design and creation of a system to solve an immediate problem without regard to future connectivity or integration with other systems [1]

Wednesday 6 May 2009

Clojure 1.0 Released!

After only 18 months or so since Clojure came to life, it's now released as v1.0. See here for the official announcement.

Sunday 3 May 2009

Levenshtein Distance in Clojure (II)

In my last post, I looked at the Levenshtein Distance and showed that the time complexity of the algorithm (when naively implemented) is exponential. In this post I'll show how we can improve this time complexity of this algorithm with a small series of changes.

Dynamic programming is a technique for solving problems which consist of overlapping subproblems. The Levenshtein algorithm is perfect for this kind of problem as the final solution is the sum of the little solutions. Imperative languages will often solve this using a M x N matrix, where each item represents the cost taken to convert an item so far. This is expanded in much more detail on Wikipeda. This solution is nice, and computationally cheap, but it's not faithful to the definition of the problem. It's an example of the impedance mismatch between writing software and writing specifications for a problem (see previous post about Intentional Software for one approach to minimizing this mismatch).

So how can we keep faithful to the original specification without having exponential performance cost? Memoization is a technique for caching the results of function calls. Functional programming languages are inherently suitable for memoization, as every pure function will, for the same arguments, yield the same result. Clojure has direct support for memoization via memoize. If we were dealing with a simple example where the function body didn't refer to the function name, we could just use memoize directly, but Levenshtein distance refers to itself three times. There are a number of options here, we could either convert to a tail recursive function and recur will solve the problem or we could use a forward declaration (thanks for #Clojure for pointing that one out, you don't want to know what my plan C was!).

Transforming the algorithm to be tail recursive is going to be painful and radically change the way the code looks. So let's go for the forward declaration approach. We declare levenshtein-distance-fast up front, write the declaration of levenshtein-distance referring to the fast implementation (as yet unbound) and finally declare levenshtein-distance-fast as the memoized version of levenshtein-distance.



(declare levenshtein-distance-fast)

(defn levenshtein-distance-int
  "Calculates the edit-distance between two sequences"
  [seq1 seq2]
  (cond
    (empty? seq1) (count seq2)
    (empty? seq2) (count seq1)
    :else (min
           (+ (cost (first seq1) (first seq2)) (levenshtein-distance-fast (rest seq1) (rest seq2))) ;; substitution
           (inc (levenshtein-distance-fast (rest seq1) seq2))    ;; insertion
           (inc (levenshtein-distance-fast seq1 (rest seq2)))))) ;; deletion

(def levenshtein-distance-fast (memoize levenshtein-distance))

This makes a huge difference in performance as you can see below:


  user> (time (levenshtein-distance-fast "abcdefghi" "hijklmnop"))
  "Elapsed time: 0.191776 msecs"

  user> (time (levenshtein-distance-fast "abcdefghi" "hijklmnop"))
  "Elapsed time: 0.081372 msecs"

  user> (time (levenshtein-distance-slow "abcdefghi" "hijklmnop"))
  "Elapsed time: 668.713644 msecs"
  9

This (probably!) gives the same complexity class as a implementation using dynamic programming. However, it still has one big problem. It isn't tail recursive and hence it won't work for large very length sequences passed as arguments. Altering the definition to use tail recursion is not something I fancy doing!