Fatvat

Sunday, 12 July 2009

Merging RSS Feeds

In order to investigate XML parsing in Clojure I knocked up a quick utility to aggregate RSS feeds. The idea is that a user provides a number of URLs, runs the program and a merged RSS feed is returned, sorted by publish date.

Firstly, let's write a trivial function to join any number of lists together according to a selection function. This will be used to merge together the XML by selecting the minimum published date each time.

The function f should be a function of two arguments that returns one of the arguments. For example:


  user> (join-all min [1 2 3] [1 1 3])
  (1 1 1 2 3 3)

  user> (join-all min [2 4 6 8 10] [1 3 5 7 9])
  (1 2 3 4 5 6 7 8 9 10)

Next, we want some utilities to mess around with RSS format and select various XML elements. Clojure Contrib provides a lazy xml package, together with some utilities to make XML zip filtering easier (I previously looked at zip filtering here).

Since the examples for clojure.contrib.zip-filter.xml already use RSS this is really trivial:

Note that the code above already handles URLs (if the supplied type to parse-trim is a URI then this is resolved, retrieved and parsed. Finally, all we need to do is put it together:

Comparing XML dates is very tiresome in Java because it's supplied date/time libraries are painful. Thankfully, Joda Time provides a solution (see here for a description of the best way to parse a date time).

Tuesday, 7 July 2009

Searching Graphs

Continuing from the previous post, Finding the Target, it's now time to focus on the basic graph searching algorithms introduced in Chapter 6 of PAIP.

The major difference between searching a graph and searching a tree is that a graph may contain cycles. If you're not careful you'll end up in a infinite loop, constantly searching the same cycle. graph-search handles this by keeping track of old states and defining an equality function that allows you to customize the way two states are considered equal (for example, in a board game, states may be equal if they are rotations of each other). new-states generates successors that haven't been seen before.

This code compares quite well with the ANSI Common Lisp code in PAIP, not needing to use funcall simplifies the code a little, and fn instead of lambda is less verbose. The implementation of A* search below doesn't fair so well!

The A* Search algorithm is a best-first graph search algorithm based on the heuristics you supply. Paths with the least cost so far are explored first. The code below gives an implementation heavily based on the one in PAIP. The code feels butt-ugly and I think this is due to the mutable state that I've ported over from the original. There's almost certainly a nicer functional implementation in there somewhere!

So how can we see what the A* search is doing? I wrote a quick maze algorithm to visualize which is shown in the image. The app uses the A* algorithm to find a route from the top-left to the bottom-right, avoiding the black walls. It's is interactive, left clicking builds a wall, left clicking again destroys it. Right-clicking solves the maze if possible, and shows the shortest solution in pink.

The complete code can be seen in my Git repository. There's a really nice Javascript visualization out there if you don't fancy building the Clojure code.

Friday, 3 July 2009

Finding the Target

I've been steadily working my way through Paradigms of Artificial Intelligence Programming which I am thoroughly enjoying. This time round it's time to look at simple search algorithms. Norvig introduces all the traditional tree search algorithms based off a single function shown below:

From this basic algorithm we can express two simple tree search algorithms, depth first and breadth first.

For depth-first, we simply follow one branch of the tree until it matches the goal or fails to complete. The combiner function puts the next node to be searched at the front of the list of states so that we always follow paths to completion. Depths first search can have problems when the structure to be searched is infinite, as it will blindly follow a path forever. Breadth-first works by exploring all paths at the same level. Breadth-first search doesn't suffer from the same infinite search problem as depth-first search, but it's potentially slower because it is covering each and every node. Both depth and breadth based searches have the same fundamental disadvantage, they don't have any knowledge about the state space and just methodically follow the same pattern regardless of the search.

Best-first search chooses the successor nodes to search first by searching on the results of a cost-function.

This will still visit all the nodes, but the choice of what to search next is based on the cost function. If the cost function is right, then best-first-search can be significantly better than depth or breadth first searching. Beam search is an optimization of best search that forgoes the requirement of searching every node. Instead only a fixed number of alternative states (the beam width) are considered.

The advantage of this is lower memory requirements and that the nodes that the beam width covers can be searched deeper. However, if the beam width is too small, then this search algorithm can fall into the trap of endlessly searching down the wrong path.

To illustrate these search algorithms, we'll use the ridiculous example of the game of darts. Briefly, Darts is played by drinking vast quantities of alcohol and then trying to throw darts at a board. You score points between 1 and 20 (either normal, double or triple), and there's also a outer bull (25) and an inner bull (50). You start from 501, and try to get to zero as quickly as possible, with the caveat that the last dart you throw should be a double or a bulls-eye.

The code below models darts as a series of throws:

Now we can use this with the search algorithms described above.

From the descriptions, you'd expect breadth and depth first search to find the right solution, but be a bit rubbish about doing so. Because they have no idea of the means to get to the goal, they'll just chuck darts about randomly and eventually weave their way to a solution. And indeed they do.

Solving the game of darts using depth-first search from 501 is not ideal! The first branch is happens to explore is the contains 1, so the search repeatedly throws 1's until it's not a legal move. Not good!

uk.co.fatvat.search> (count (:throws (solve-darts-depth-first 501)))     
500

Let's see if beam search can do any better:

The cost function models it as however close we are to zero divided by the number of throws of the data. The goal should be to get to the finish as quickly as possible.

uk.co.fatvat.search> (:throws (solve-darts-beam-search 501))
[441 381 321 261 201 141 84 30 0]

We can see that this does achieve the perfect 9 dart finish. (T20,T20,T20,T20,T20,T20,T19,T17,D15). Finishing from 1000001 will apparently take 16667 darts!

The approach described in the book is very clear. The key realization for me was that you don't have to have the entire state space in memory at once for a tree algorithm to succeed; you just need to think in terms of the next step to take.