Tuesday, 27 September 2011

Avoid C# Enums!

C# enums are closer to glorified integers than Java's. In Java, you're able to avoid the horrible pattern where the function has to throw an exception if a new enum is added, such as that shown below. This pattern is yucky because the compiler is unable to determine this at compile-time.

   public enum Foo { Bar, Baz };

   // Has to throw the exception
   public void doSomething(Foo f) {
     switch(f) {
        case Bar:
           System.out.println("!! Bar");
           break;
        case Baz:
           System.out.println("Baz !!");
           break;
        default:
           throw new IllegalArgumentException("Unsupported enum type!");
      }
   }

In Java, you can give the enum behaviour.

   public enum Foo { 
     Bar("!! Bar"), 
     Baz("Baz !!);

     private string stringRepresentation;

     private Foo(string s) { stringRepresentation = s; }

     public string getString() { return stringRepresentation; }
   }

   public void doSomething(Foo f) {
      return f.getString();
   }

This is cool because the only way you can add a new enum value and get it to compile is to give it a valid string representation.

In C#, enum's can't have behaviour so I think they should be avoided like the plague. Changes are if you are switching on an enum value then the behaviour you are executing belongs on the class anyway!

You can simulate enums with a class, a private constructor and some static instances of the class representating the enum values.

  public sealed class Foo 
  {
    public static readonly Bar = new Foo("!! Bar");
    public static reaodnly Baz = new Foo("Baz !!");

    private string m_StringRepresentation;

    private Foo(string stringRepresentation) {
        m_StringRepresentation = stringRepresentation;
    }
  
    public string StringRepresentation { get { return m_StringRepresentation; } } 
  }

Saturday, 4 June 2011

Deploying a Yesod application on Linode

A VPS is a virtual machine that can be used for software hosting, it's typically a lot cheaper than a dedicated server and it's a good way to spend a few dollars and get a machine on the internet!

I use Linode for my VPS and it's cheap, simple and I've never had any problems with it. The goal of this is for me to write down how I deployed an application written using Yesod to a Linode instance. I found it wasn't as straight-forward as I'd hoped so I've tried to document all the steps I went through.

First off, create yourself a 32bit Ubuntu 10.04 image (with less than 4GB of RAM, 64 bit just wastes space). Once you've got this, ssh onto the machine and download the latest and greatest Haskell and Haskell platform.

To install GHC

  wget http://www.haskell.org/ghc/dist/7.0.3/ghc-7.0.3-i386-unknown-linux.tar.bz2
  tar xvf ghc-7.0.3-i386-unknown-linux.tar.bz2
  cd ghc-7.0.3
  sudo apt-get install gcc libgmp3-dev
  ./configure
  make install

After this has completed, fire up ghci and verify you can evaluate Haskell commands (print "Hello World" ought to confirm that everything is working, missing libgmp3-dev causes ghc to install but not run). Next the Haskell platform.

  wget http://lambda.galois.com/hp-tmp/2011.2.0.1/haskell-platform-2011.2.0.1.tar.gz
  tar xvf haskell-platform-2011.2.0.1.tar.gz
  sudo apt-get install zlib1g-dev libglut3-dev
  make
  make install

I had a somewhat bizarre issue where the first round of make failed to build OpenGL. Running make again made the troubles go away. No idea why, but it doesn't seem to have caused any problems. At this stage you should have a completely working Haskell environment, with lots of lovely packages installed from Hackage.

Next step is to actually build and run the application. I've got my code up on a Mercurial instance, so I simply clone the repository onto the build server, and then I can just use cabal to build locally and end up with an executable. This reddit comment suggests that building and linking with GHC takes huge amounts of memory. I'm on the cheapest plan at Linode which only have 512MB of RAM and I had no problems, but then again I'm only deploying toy applications at the moment. As said on Reddit, if you do run into problems build locally with the same architecture and upload to the server.

The recommended way of deploiying a Yesod application is to use nginx as a reverse proxy, together with Warp. Getting nginx up and running on Linode is documented here. Nginx is the 10.04 repositories is only up to version 0.76, so I built the latest release from source.

The below is simple nginx.conf file adapted from the Yesod web site.

  daemon off; # don't run nginx in the background, good for monitoring app
  events {
      worker_connections 4096;
  }
  http {

      # If multiple domains are hosted on the same box, this can be used to
      # block everything that isn't directed at a specific host
      server {
          listen 80 default_server;
          server_name _;
          return 444;
      }

      server {
           # Important, serve files with the appropriate mime-type
           # Without this some browsers don't interpret JS, CSS correctly
           include /opt/nginx/conf/mime.types;

           listen 80; 
           server_name search.boringtosser.co.uk;
           location / {
               proxy_pass http://127.0.0.1:3000;
           }

           # Used for serving static content
           location /static {
               root /home/jeff/code/keywordSearch;
               expires max;
           }
       }               
}

nginx comes with default start scripts in /etc/init.d. /etc/init.d/nginx start starts the server running with the default configuration. Swapping start for stop obviously stops it! To get your Yesod application built use cabal, making sure to build the production version with cabal -fproduction configure. As described in the Yesod book you can use an Upstart script to keep things up and running. And that should be all there is to it!

Thursday, 2 June 2011

Writing a simple keyword search engine using Haskell and Redis

The job of a search engine is simple, given some search terms find some relevant documents containing those terms. I thought it'd be fun to write a very simple one and see what I could learn along the way. search.boringtosser.co.uk contains the end result.

The most fundamental data structure in a search engine is the inverted index that maintains a mapping from content (such as words or numbers) to the documents that contain those terms. For example:

  Doc 1 = quick brown fox
  Doc 2 = brown owl
  Doc 3 = fox tango

  quick => { 1 }
  brown => { 1, 2 }
  fox   => { 1, 3 }
  owl   => { 2 }
  tango => { 3 }

An inverted index makes it a simple problem to go from a search term to the documents containing that term. There are (at least!) a couple of problems with just building a reverse index blindly from the data. How do you make sure that variations of the same word go to the same document? For example, if I search for "bananas" I'd like to get documents containing the term "banana" returned as well. Secondly, how do you deal with words like "the" that are going to occur in almost every single document?

Stemming is a process to reduce words to their base form. For example, "eating", "eaten" and "eat" all result in the same stemmed word "eat". Most search engines deal with common words by filtering them out. Search engines refer to these words as stop words and it's as simple as filtering out a known list of common words.

Users communicate with a search engine using a query language. For this toy example, I'll use a simple query language that supports keyword matching and the Boolean operators, AND and OR.

In order to show a search engine we'll also need a data set. I've used the Jar of Fortune data as this gives lot sof short, simple documents.


(image generated using Jim's fortune cookie image generator and text from XKCD Fortune Cookies)

Implementation


Where should the inverted index be stored? You could simply store this data in a hash table, but then you have to worry about persisting the data, scaling to large amounts of data and so on. A Key/Value store provides a very efficient way of storing this map based data. My current favourite store is Redis as it's well documented, stable, ultra-fast and has a very simple API to use from a variety of languages, including Haskell.

To make things simpler, I'll also store the documents themselves in Redis using the standard string functions. This is OK for this example, as I'm just storing small strings. Indexing larger files would probably want to use some disk-based structure. I'll use Redis's sorted set feature to store the word to document mapping. This allows you to store a set of strings, where each entry in the set has an associated score representing how many times that word has occurred. The higher the score, the more relevant the search result.

The first task is to turn a lump of text into a series of terms to index into Redis. Stop word removal is a bit of a no-brainer, just grab a giant list of stop words and remove them. isStopWord :: T.Text -> Bool. The complete code is dull, unexciting and here! Stemming is more interesting. Developed in 1980, the Porter stemming algorithm consists of 5 or so simple rules to stem a word. Better still, there is already an existing Haskell implementation. I used that implementation and adapted it ever so slightly to use Data.Text. The full version of the code for stemming here.

The Jar of Fortunes data is in two different formats (one '%' separated, the other blank line separated). A quick script parses those from one giant text string into individual elements, and adds a reference to the document for each term. For example, if the quote is "quick fox" and the quote is #123 in file foo, then three entries would be created (quick -> (foo#123,1), fox -> (foo#123,1)). The code for indexing cookies is below:



Next up is a way of performing the search, but before we can proceed with that we need to define what a search actually is. Most search engines support a simple query language, providing simple features like boolean expressions and exact phrase matching. For my little toy engine, I'll just stick with supporting exact match for single words and the boolean operations 'AND and 'OR'. Implicit OR's are placed between adjacent terms. For example, searching for resplendent fish is treated as resplendent OR fish.

Once the query is passed we need to evaluate the AST and get the results. The leaf case is simple, given a "Contains" query the only task is to create the search term from the text and look up the values in the inverted index. The Boolean operators are more interesting. As is obvious, "and" and "or" correspond to set intersection and set union. One way to implement this would be to look up the values from the left hand side and the right hand side and calculate the intersection, but this is a very choice because it involves grabbing all documents from Redis to the server. A better choice is to make use of a Redis function to do the heavy lifting. Redis provides ZINTERSTORE and ZUNIONSTORE for just this purpose. Given two (or more) existing keys, they calculate the intersection and union of the keys and store the result in a new key. Storing intermediate results in new keys runs the risk that the memory usage will grow unbounded, but it does allow query fragment caching (as long as the intermediate key represents the query). Although this doesn't make much difference for such a small data set, it's likely that on a large data set caching frequently used query strings will give some benefit. Redis provides ttl commands such as ttl, expire and persist to help address the memory consumption issues. I've made sure that intermediate keys have a short ttl and that should (hopefully) mean that the memory shouldn't grow infinitely.

I used Parsec to turn a query string into an abstract syntax tree (mostly based on the example code on the Haskell wiki). The query function processes a search request and returns a key containing the results that can be retrieved using getKeyValues.



After all this I have a query engine that works in ghci. That's a bit boring, so time to get it deployed on a web server somewhere. I used Yesod to create a website. For a simple one page website, the only code I had to write was to initialize the Redis connection and the logic to execute the search request.



The getRootR function just loads the templates to display the home page. getSearchR performs the search. The biggest problem I ran into was messages such as <socket: 7>: hFlush: resource vanished (Broken pipe). This was because I had tried to keep a connection permanently open to Redis, instead of using a pool. After some helpful people in #haskell pointed me in the right direction, I switched over to using Data.Pool and all was good.

The deployed application is available at http://search.boringtosser.co.uk and the complete set of code is available on github (and hopefully it comes with a working cabal file this time!). Any suggestions on how to improve the code or just anything I'm doing stupidly is much appreciated!

Tuesday, 10 May 2011

Method Names in the Java Development Kit and Spring Framework

"When I use a word," Humpty Dumpty said, in a rather scornful tone, "It means just what I choose it to mean — neither more nor less."

(Lewis Carroll, Through the Looking Glass)

Names are probably the most important abstraction device when writing code. A good name can hide a multitude of sins (who cares how vomit inducing a method implementation is, if the name is right?), but a bad one invokes feelings of rage (I've found a fileExists method that returns false if the file exists...).

Parts of Java have a very rigid naming standard (for example JavaBean conventions). This is a good thing for some reasons (see the number of packages that use reflection to take advantage of this), but a bad thing for others when you end up making up terrible names just to satisfy some naming convention.


In order to explore the sort of names that Java uses I wrote a quick script that trawled the JavaDoc from the standard library (JDK6.0), reflectively looked up all the methods and split them up into words (assuming camel casing and ignoring anything with $ or _ in the name). And then a bit of manual munging...

The most popular starting word for a Java method is (unsuprisingly) get.




The longest method is (deep breath), supportsDataDefinitionAndDataManpulationTransactions. The most number of words that an individual method has is eight, and there's quite a few of these as shown below.

The distribution of lengths of method names shows a nice Gaussian distribution.
Spring started life as a simpler alternative to Enterprise JavaBean framework and is now one of the most popular frameworks for Java. How do the naming conventions of Spring compare to the Java world? Does it have any factory factories? First off, a simple frequency analysis of the first word of methods. As expected, lots of get, set and is.
maybe bind this or target or args from pointcut expression? That's the method with the largest number of words in the Spring framework (or a strange Haiku?). The largest number of words in a public method is the monster set apply web request interceptors to render phrase only. The distribution of lengths of method names again shows a normal distribition
With that as inspiration I decided to finally get around to creating methodnamer.com, the new one-stop shop for adding the method names for classes that you've named using classnamer.com. You can generate methods based on either Spring or the JDK. Are there any method names in the JDK that are Haiku? A Haiku should contain 17 syllables in the pattern 5-7-5. I used the Perl module, Lingua::EN::Syllable and a bit of hackery and didn't find anything that was an exact match. The best I could come up with was:
create selection 
   model property change
   listener

Tuesday, 15 March 2011

TODO Write more Posts

Wow, it's been three months since I did anything to my blog. That can't be good so time for a post to remind myself to do something...

Over the next few months, I think I'll try and post more things about algorithms in general. I've been spending my time trying to be a bit better at the sort of algorithm problems that TopCoder and Spoj present. This post is designed to act as a reminder for me to actually do so!

So first up, I'll write something about Kd-Tree's , and hopefully now I've written that I will that'll be motivation enough for something to appear over the next few weeks.