Fatvat: June 2011

A VPS is a virtual machine that can be used for software hosting, it's typically a lot cheaper than a dedicated server and it's a good way to spend a few dollars and get a machine on the internet!

I use Linode for my VPS and it's cheap, simple and I've never had any problems with it. The goal of this is for me to write down how I deployed an application written using Yesod to a Linode instance. I found it wasn't as straight-forward as I'd hoped so I've tried to document all the steps I went through.

First off, create yourself a 32bit Ubuntu 10.04 image (with less than 4GB of RAM, 64 bit just wastes space). Once you've got this, ssh onto the machine and download the latest and greatest Haskell and Haskell platform.

To install GHC

  wget http://www.haskell.org/ghc/dist/7.0.3/ghc-7.0.3-i386-unknown-linux.tar.bz2
  tar xvf ghc-7.0.3-i386-unknown-linux.tar.bz2
  cd ghc-7.0.3
  sudo apt-get install gcc libgmp3-dev
  ./configure
  make install

After this has completed, fire up ghci and verify you can evaluate Haskell commands (print "Hello World" ought to confirm that everything is working, missing libgmp3-dev causes ghc to install but not run). Next the Haskell platform.

  wget http://lambda.galois.com/hp-tmp/2011.2.0.1/haskell-platform-2011.2.0.1.tar.gz
  tar xvf haskell-platform-2011.2.0.1.tar.gz
  sudo apt-get install zlib1g-dev libglut3-dev
  make
  make install

I had a somewhat bizarre issue where the first round of make failed to build OpenGL. Running make again made the troubles go away. No idea why, but it doesn't seem to have caused any problems. At this stage you should have a completely working Haskell environment, with lots of lovely packages installed from Hackage.

Next step is to actually build and run the application. I've got my code up on a Mercurial instance, so I simply clone the repository onto the build server, and then I can just use cabal to build locally and end up with an executable. This reddit comment suggests that building and linking with GHC takes huge amounts of memory. I'm on the cheapest plan at Linode which only have 512MB of RAM and I had no problems, but then again I'm only deploying toy applications at the moment. As said on Reddit, if you do run into problems build locally with the same architecture and upload to the server.

The recommended way of deploiying a Yesod application is to use nginx as a reverse proxy, together with Warp. Getting nginx up and running on Linode is documented here. Nginx is the 10.04 repositories is only up to version 0.76, so I built the latest release from source.

The below is simple nginx.conf file adapted from the Yesod web site.

  daemon off; # don't run nginx in the background, good for monitoring app
  events {
      worker_connections 4096;
  }
  http {

      # If multiple domains are hosted on the same box, this can be used to
      # block everything that isn't directed at a specific host
      server {
          listen 80 default_server;
          server_name _;
          return 444;
      }

      server {
           # Important, serve files with the appropriate mime-type
           # Without this some browsers don't interpret JS, CSS correctly
           include /opt/nginx/conf/mime.types;

           listen 80; 
           server_name search.boringtosser.co.uk;
           location / {
               proxy_pass http://127.0.0.1:3000;
           }

           # Used for serving static content
           location /static {
               root /home/jeff/code/keywordSearch;
               expires max;
           }
       }               
}

nginx comes with default start scripts in /etc/init.d. /etc/init.d/nginx start starts the server running with the default configuration. Swapping start for stop obviously stops it! To get your Yesod application built use cabal, making sure to build the production version with cabal -fproduction configure. As described in the Yesod book you can use an Upstart script to keep things up and running. And that should be all there is to it!

The job of a search engine is simple, given some search terms find some relevant documents containing those terms. I thought it'd be fun to write a very simple one and see what I could learn along the way. search.boringtosser.co.uk contains the end result.

The most fundamental data structure in a search engine is the inverted index that maintains a mapping from content (such as words or numbers) to the documents that contain those terms. For example:

  Doc 1 = quick brown fox
  Doc 2 = brown owl
  Doc 3 = fox tango

  quick => { 1 }
  brown => { 1, 2 }
  fox   => { 1, 3 }
  owl   => { 2 }
  tango => { 3 }

An inverted index makes it a simple problem to go from a search term to the documents containing that term. There are (at least!) a couple of problems with just building a reverse index blindly from the data. How do you make sure that variations of the same word go to the same document? For example, if I search for "bananas" I'd like to get documents containing the term "banana" returned as well. Secondly, how do you deal with words like "the" that are going to occur in almost every single document?

Stemming is a process to reduce words to their base form. For example, "eating", "eaten" and "eat" all result in the same stemmed word "eat". Most search engines deal with common words by filtering them out. Search engines refer to these words as stop words and it's as simple as filtering out a known list of common words.

Users communicate with a search engine using a query language. For this toy example, I'll use a simple query language that supports keyword matching and the Boolean operators, AND and OR.

In order to show a search engine we'll also need a data set. I've used the Jar of Fortune data as this gives lot sof short, simple documents.

(image generated using Jim's fortune cookie image generator and text from XKCD Fortune Cookies)

Implementation

Where should the inverted index be stored? You could simply store this data in a hash table, but then you have to worry about persisting the data, scaling to large amounts of data and so on. A Key/Value store provides a very efficient way of storing this map based data. My current favourite store is Redis as it's well documented, stable, ultra-fast and has a very simple API to use from a variety of languages, including Haskell.

To make things simpler, I'll also store the documents themselves in Redis using the standard string functions. This is OK for this example, as I'm just storing small strings. Indexing larger files would probably want to use some disk-based structure. I'll use Redis's sorted set feature to store the word to document mapping. This allows you to store a set of strings, where each entry in the set has an associated score representing how many times that word has occurred. The higher the score, the more relevant the search result.

The first task is to turn a lump of text into a series of terms to index into Redis. Stop word removal is a bit of a no-brainer, just grab a giant list of stop words and remove them. isStopWord :: T.Text -> Bool. The complete code is dull, unexciting and here! Stemming is more interesting. Developed in 1980, the Porter stemming algorithm consists of 5 or so simple rules to stem a word. Better still, there is already an existing Haskell implementation. I used that implementation and adapted it ever so slightly to use Data.Text. The full version of the code for stemming here.

The Jar of Fortunes data is in two different formats (one '%' separated, the other blank line separated). A quick script parses those from one giant text string into individual elements, and adds a reference to the document for each term. For example, if the quote is "quick fox" and the quote is #123 in file foo, then three entries would be created (quick -> (foo#123,1), fox -> (foo#123,1)). The code for indexing cookies is below:

Next up is a way of performing the search, but before we can proceed with that we need to define what a search actually is. Most search engines support a simple query language, providing simple features like boolean expressions and exact phrase matching. For my little toy engine, I'll just stick with supporting exact match for single words and the boolean operations 'AND and 'OR'. Implicit OR's are placed between adjacent terms. For example, searching for resplendent fish is treated as resplendent OR fish.

Once the query is passed we need to evaluate the AST and get the results. The leaf case is simple, given a "Contains" query the only task is to create the search term from the text and look up the values in the inverted index. The Boolean operators are more interesting. As is obvious, "and" and "or" correspond to set intersection and set union. One way to implement this would be to look up the values from the left hand side and the right hand side and calculate the intersection, but this is a very choice because it involves grabbing all documents from Redis to the server. A better choice is to make use of a Redis function to do the heavy lifting. Redis provides ZINTERSTORE and ZUNIONSTORE for just this purpose. Given two (or more) existing keys, they calculate the intersection and union of the keys and store the result in a new key. Storing intermediate results in new keys runs the risk that the memory usage will grow unbounded, but it does allow query fragment caching (as long as the intermediate key represents the query). Although this doesn't make much difference for such a small data set, it's likely that on a large data set caching frequently used query strings will give some benefit. Redis provides ttl commands such as ttl, expire and persist to help address the memory consumption issues. I've made sure that intermediate keys have a short ttl and that should (hopefully) mean that the memory shouldn't grow infinitely.

I used Parsec to turn a query string into an abstract syntax tree (mostly based on the example code on the Haskell wiki). The query function processes a search request and returns a key containing the results that can be retrieved using getKeyValues.

After all this I have a query engine that works in ghci. That's a bit boring, so time to get it deployed on a web server somewhere. I used Yesod to create a website. For a simple one page website, the only code I had to write was to initialize the Redis connection and the logic to execute the search request.

The getRootR function just loads the templates to display the home page. getSearchR performs the search. The biggest problem I ran into was messages such as <socket: 7>: hFlush: resource vanished (Broken pipe). This was because I had tried to keep a connection permanently open to Redis, instead of using a pool. After some helpful people in #haskell pointed me in the right direction, I switched over to using Data.Pool and all was good.

The deployed application is available at http://search.boringtosser.co.uk and the complete set of code is available on github (and hopefully it comes with a working cabal file this time!). Any suggestions on how to improve the code or just anything I'm doing stupidly is much appreciated!

Fatvat

Saturday, 4 June 2011

Deploying a Yesod application on Linode

Thursday, 2 June 2011

Writing a simple keyword search engine using Haskell and Redis

Implementation