Fatvat

Word Ladders were invented by Lewis Carroll as a form of word play. The idea is that you start with a word, change one letter at a time, and end up with a new word. All words in the chain must be valid words in the dictionary (otherwise it'd be a bit rubbish). For example, you can turn beer into wine with the following

  BEER
  BEAR
  BEAD
  BEAK
  BEAT
  BELT
  WELT
  WILT
  WILE
  WINE

Let's write a simple program that calculates word ladders for a simplified version of the game where each change can only change a single letter at a time (rather than allow deletes and inserts too). Finding a word ladder can be thought about as a graph problem. Each node is a word, and each edge represents a connection to another valid word. The graph below is a partial representation of the graph starting with BEER. The real graph is much more complicated!

First things first, how do we build the graph? For starters we need to know if two words are neighbours or not. The following simple functions determine the difference between two strings. Remember that I'm only working on the simplest possible distance metric at the moment, that is allowing a single character to change.

neighbour :: String -> String -> Bool
  neighbour x y = difference x y == 1
                                     
  difference :: String -> String -> Int                     
  difference [] [] = 0
  difference (x:xs) (y:ys) | x == y = difference xs ys
                           | otherwise = 1 + difference xs ys
  difference _ _ = error "Two strings must be the same length"

Next thing we need to grab is a huge list of words. We'll store these words in a Set because we'll want to test for membership frequently (an O(lg(N)) operation. An alternative would be to use a perfect hash (this is an option because the dictionary is fixed). This would give O(1) lookup times, and it looks like there is (as almost always) a library on Hackage that does just that. The simple distance metric chosen means we can limit the number of words based on the size of the input word.

import qualified Data.Set as S
  type WordSet = S.Set String

  wordListPath :: String
  wordListPath = "/usr/share/dict/british-english"

  createDictionary :: Int -> IO WordSet    
  createDictionary n = do
    file <- readFile wordListPath 
    return $ S.fromList $ filter (\x -> length x == n && all isAlpha x) (map (map toLower) $ words file)

Once we've got a dictionary, all we need to do is build the graph. Since Haskell is lazy, we don't need to worry about the space complexity of the graph - we'll just build it lazily and only the bit that is explored will be resident in memory.

Each node contains the word it represents and the links to the child elements. The graph is built by starting at a root, and filling all the valid neighbours. Each time we place a word in the graph we remove it from the dictionary, otherwise we'll get cycles in the graph.

data Node = Node String [Node] deriving Show

  buildGraph :: WordSet -> String -> Node 
  buildGraph wordset top = Node top (map (buildGraph smaller) neighbours)
    where
      neighbours = S.toList (S.filter (neighbour top) smaller)
      smaller = S.delete top wordset

The graph is *huge*, so we need to find some way to limit the search space. The most obvious way is to give a restriction on the depth of the search. A word ladder that is 10000 words rungs high is probably not much fun to complete. We can also cut the search short if the word is too many changes away given the maximum depth (for example, if 4 characters need to change in a 5 letter word and the maximum left to search is 3 then we can prune this search branch).

search :: Node -> Int -> String -> [String]
  search graph maxDepth goal = search' graph maxDepth goal []

  search' :: Node -> Int -> String -> [String] -> [String]
  search' (Node end children) maxDepth goal path 
    | end == goal    = end : path 
    | null children  = [] 
    | length path >= maxDepth = [] -- too deep
    | difference end goal >= maxDepth - length path = [] -- too much difference
    | otherwise = first
      where
        childRoutes = filter (not . null) $ 
                      map (\child -> search' child maxDepth goal (end : path)) children
        first | null childRoutes = []
              | otherwise        = head childRoutes          
        quickest | null childRoutes = []
                 | otherwise = minimumBy (comparing length) childRoutes

The way we search the children is important. In this case we've used first gone for the first available route that satisfies the depth guarantee, but isn't guaranteed to be the shortest route. quickest on the other hand calculates all child routes and finds the minimum length part.

Finally, we can put this all together and write a simple search search function.

makeLadder :: Int-> String -> String -> IO [String]
  makeLadder maxDepth start end 
    | length start /= length end = error "Only two strings of equal length are currently supported."
    | otherwise = do    
        dict <- createDictionary (length start)
        if (S.member start dict && S.member end dict)
          then return $ search (buildGraph dict start) maxDepth end
          else return []

The complete code for this version available on my git hub repo here. This version has several problems.

It's too slow - searching for the minimal path can take considerable time
It's not very flexible

(this part completely edited from the original post, thanks to insightful comment about something *very* daft I was doing) Let's fix that. In order to be flexible, we need to support various distance metrics. For example, it'd be nice to allow insertions and deletions as well as character transposition. We just need a generic function that calculates the distance between any given strings. In order to improve performance, instead of searching the entire dictionary to see whether any are neighbours, we can generate the neighbours and find out which ones are in the dictionary. That means we need both a distance metric and a way of calculating the edits.

  data DistanceMetric = DistanceMetric (Word -> Word -> Int) (Word -> WordSet)

  difference :: Word -> Word -> Int                     
  difference x y 
    | length x /= length y = 999999
    | otherwise = sum $ zipWith (\c1 c2 -> if c1 == c2 then 0 else 1) x y
                
  transposeChar :: Word -> [Word] 
  transposeChar [] = []
  transposeChar (x:xs) = map (:xs) (validChars \\ [x])
                
  deleteChar :: Word -> [Word]                       
  deleteChar [] = []
  deleteChar (x:xs) = [xs]

  insertChar :: Word -> [Word]
  insertChar [] = []
  insertChar (x:xs) = map (\y -> y:x:xs) validChars

  differenceEdit :: Word -> WordSet
  differenceEdit x = edit' x [transposeChar]
    
  editDistanceEdits :: Word -> WordSet
  editDistanceEdits x = edit' x [insertChar,transposeChar,deleteChar]

  edit' :: Word -> [Word -> [Word]] -> WordSet
  edit' w fns = S.fromList $ concat $ 
                zipWith (\x y -> map (\z -> x ++ z) (concatMap (\x -> x y) fns)) 
                (inits w) (tails w)

  simple :: DistanceMetric
  simple = DistanceMetric difference differenceEdit

  edits :: DistanceMetric
  edits = DistanceMetric editDistance editDistanceEdits

This gives two distance functions and two ways of generating edits. The Levenshtein distance of 1 is generated by transposing, deleting and inserting characters from the original word. This gives us the flexibility, because another distance metric could be put in place (anagrams perhaps?). Next to performance.

  buildGraph :: DistanceMetric -> WordSet -> Word -> Node 
  buildGraph d@(DistanceMetric dist edits) wordset top = Node top (map (buildGraph d smaller) neighbours)
    where
      possibleNeighbours = edits top
      neighbours = S.toList (smaller `S.intersection` possibleNeighbours)
      smaller = S.delete top wordset 
    
  search :: DistanceMetric -> Node -> Int -> Word -> [Word]
  search (DistanceMetric dist _) graph maxDepth goal = search' graph []
    where 
      search' (Node end children) path 
        | end == goal    = end : path 
        | length path >= maxDepth = [] -- too deep
        | dist end goal >= maxDepth - length path = [] -- too much difference
        | otherwise = first
          where
            -- Find the best node to search by comparing it against the goal
            costForNextChild :: [(Int,Node)]
            costForNextChild = zip (map (\(Node x _) -> dist x goal) children) children
            bestFirst = map snd $ sortBy (comparing fst) costForNextChild
        
            -- Best first search
            childRoutes = filter (not . null) $ map (\child -> search' child (end : path)) bestFirst
      
            first | null childRoutes = []
                  | otherwise        = head childRoutes

Two things have changed from the original code. The first is that the graph is built by comparing the edits against the dictionary, rather than the word against the whole dictionary. This is the main saving and makes it *hugely* faster (thanks to jkkramer for the pointer and this post.) The only other change is that we decide which node to search next based on how close it is to the goal (a best-first search). With these changes it can now solve all of the problems I've tried at wordchains.com. Neat. The complete code is available here.

Google Refine is a toy for playing with data, originally developed by the team behind Freebase. It's a tool for taking messy data sources and getting some structure in with them. This is exactly the same sort of thing that I need for my super-secret-world-changing-awesome-side-project. Ok, it's not that super yet, not released and will probably not change the world, but it is fun and it does require adding some structure to the messy data sets that can easily be found on the web.

Refine can import data from a variety of sources from either the web or a data file on disk. To have a quick play with this, I grabbed the list of UK Prime Ministers from here and spent a couple of minutes in Emacs to reformat it as a CSV file. Once you've imported that data into Refine, you can automatically reconcile this data with structured information available online (such as Freebase). This gives you the best matches for each name. The screen below shows the results after reconciling with Freebase.

I'm not sure if it's entirely clear from the image but about 40% of the prime ministers have automatically been correlated with the appropriate entry from Freebase! A few more clicks and the entire data set can be reconciled against an on-line source and then exported in a variety of formats including HTML, CSV and JSON. This is exactly the kind of data matching I'm after, as once you've got a Freebase ID you can look up all the extra information very easily. So easy, it almost feels like cheating!

Fatvat

Tuesday, 15 March 2011

TODO Write more Posts

Saturday, 11 December 2010

Word Ladders

Friday, 12 November 2010

Google Refine