Wow, it's been three months since I did anything to my blog. That can't be good so time for a post to remind myself to do something...
Over the next few months, I think I'll try and post more things about algorithms in general. I've been spending my time trying to be a bit better at the sort of algorithm problems that TopCoder and Spoj present. This post is designed to act as a reminder for me to actually do so!
So first up, I'll write something about Kd-Tree's , and hopefully now I've written that I will that'll be motivation enough for something to appear over the next few weeks.
Tuesday, 15 March 2011
Saturday, 11 December 2010
Word Ladders
Word Ladders were invented by Lewis Carroll as a form of word play. The idea is that you start with a word, change one letter at a time, and end up with a new word. All words in the chain must be valid words in the dictionary (otherwise it'd be a bit rubbish). For example, you can turn beer into wine with the following
Let's write a simple program that calculates word ladders for a simplified version of the game where each change can only change a single letter at a time (rather than allow deletes and inserts too). Finding a word ladder can be thought about as a graph problem. Each node is a word, and each edge represents a connection to another valid word. The graph below is a partial representation of the graph starting with BEER. The real graph is much more complicated!
First things first, how do we build the graph? For starters we need to know if two words are neighbours or not. The following simple functions determine the difference between two strings. Remember that I'm only working on the simplest possible distance metric at the moment, that is allowing a single character to change.
Next thing we need to grab is a huge list of words. We'll store these words in a Set because we'll want to test for membership frequently (an O(lg(N)) operation. An alternative would be to use a perfect hash (this is an option because the dictionary is fixed). This would give O(1) lookup times, and it looks like there is (as almost always) a library on Hackage that does just that. The simple distance metric chosen means we can limit the number of words based on the size of the input word.
Once we've got a dictionary, all we need to do is build the graph. Since Haskell is lazy, we don't need to worry about the space complexity of the graph - we'll just build it lazily and only the bit that is explored will be resident in memory.
Each node contains the word it represents and the links to the child elements. The graph is built by starting at a root, and filling all the valid neighbours. Each time we place a word in the graph we remove it from the dictionary, otherwise we'll get cycles in the graph.
The graph is *huge*, so we need to find some way to limit the search space. The most obvious way is to give a restriction on the depth of the search. A word ladder that is 10000 words rungs high is probably not much fun to complete. We can also cut the search short if the word is too many changes away given the maximum depth (for example, if 4 characters need to change in a 5 letter word and the maximum left to search is 3 then we can prune this search branch).
The way we search the children is important. In this case we've used
Finally, we can put this all together and write a simple search search function.
BEER BEAR BEAD BEAK BEAT BELT WELT WILT WILE WINE
Let's write a simple program that calculates word ladders for a simplified version of the game where each change can only change a single letter at a time (rather than allow deletes and inserts too). Finding a word ladder can be thought about as a graph problem. Each node is a word, and each edge represents a connection to another valid word. The graph below is a partial representation of the graph starting with BEER. The real graph is much more complicated!
First things first, how do we build the graph? For starters we need to know if two words are neighbours or not. The following simple functions determine the difference between two strings. Remember that I'm only working on the simplest possible distance metric at the moment, that is allowing a single character to change.
neighbour :: String -> String -> Bool neighbour x y = difference x y == 1 difference :: String -> String -> Int difference [] [] = 0 difference (x:xs) (y:ys) | x == y = difference xs ys | otherwise = 1 + difference xs ys difference _ _ = error "Two strings must be the same length"
Next thing we need to grab is a huge list of words. We'll store these words in a Set because we'll want to test for membership frequently (an O(lg(N)) operation. An alternative would be to use a perfect hash (this is an option because the dictionary is fixed). This would give O(1) lookup times, and it looks like there is (as almost always) a library on Hackage that does just that. The simple distance metric chosen means we can limit the number of words based on the size of the input word.
import qualified Data.Set as S type WordSet = S.Set String wordListPath :: String wordListPath = "/usr/share/dict/british-english" createDictionary :: Int -> IO WordSet createDictionary n = do file <- readFile wordListPath return $ S.fromList $ filter (\x -> length x == n && all isAlpha x) (map (map toLower) $ words file)
Once we've got a dictionary, all we need to do is build the graph. Since Haskell is lazy, we don't need to worry about the space complexity of the graph - we'll just build it lazily and only the bit that is explored will be resident in memory.
Each node contains the word it represents and the links to the child elements. The graph is built by starting at a root, and filling all the valid neighbours. Each time we place a word in the graph we remove it from the dictionary, otherwise we'll get cycles in the graph.
data Node = Node String [Node] deriving Show buildGraph :: WordSet -> String -> Node buildGraph wordset top = Node top (map (buildGraph smaller) neighbours) where neighbours = S.toList (S.filter (neighbour top) smaller) smaller = S.delete top wordset
The graph is *huge*, so we need to find some way to limit the search space. The most obvious way is to give a restriction on the depth of the search. A word ladder that is 10000 words rungs high is probably not much fun to complete. We can also cut the search short if the word is too many changes away given the maximum depth (for example, if 4 characters need to change in a 5 letter word and the maximum left to search is 3 then we can prune this search branch).
search :: Node -> Int -> String -> [String] search graph maxDepth goal = search' graph maxDepth goal [] search' :: Node -> Int -> String -> [String] -> [String] search' (Node end children) maxDepth goal path | end == goal = end : path | null children = [] | length path >= maxDepth = [] -- too deep | difference end goal >= maxDepth - length path = [] -- too much difference | otherwise = first where childRoutes = filter (not . null) $ map (\child -> search' child maxDepth goal (end : path)) children first | null childRoutes = [] | otherwise = head childRoutes quickest | null childRoutes = [] | otherwise = minimumBy (comparing length) childRoutes
The way we search the children is important. In this case we've used
first
gone for the first available route that satisfies the depth guarantee, but isn't guaranteed to be the shortest route. quickest
on the other hand calculates all child routes and finds the minimum length part.Finally, we can put this all together and write a simple search search function.
makeLadder :: Int-> String -> String -> IO [String] makeLadder maxDepth start end | length start /= length end = error "Only two strings of equal length are currently supported." | otherwise = do dict <- createDictionary (length start) if (S.member start dict && S.member end dict) then return $ search (buildGraph dict start) maxDepth end else return []The complete code for this version available on my git hub repo here. This version has several problems.
- It's too slow - searching for the minimal path can take considerable time
- It's not very flexible
data DistanceMetric = DistanceMetric (Word -> Word -> Int) (Word -> WordSet) difference :: Word -> Word -> Int difference x y | length x /= length y = 999999 | otherwise = sum $ zipWith (\c1 c2 -> if c1 == c2 then 0 else 1) x y transposeChar :: Word -> [Word] transposeChar [] = [] transposeChar (x:xs) = map (:xs) (validChars \\ [x]) deleteChar :: Word -> [Word] deleteChar [] = [] deleteChar (x:xs) = [xs] insertChar :: Word -> [Word] insertChar [] = [] insertChar (x:xs) = map (\y -> y:x:xs) validChars differenceEdit :: Word -> WordSet differenceEdit x = edit' x [transposeChar] editDistanceEdits :: Word -> WordSet editDistanceEdits x = edit' x [insertChar,transposeChar,deleteChar] edit' :: Word -> [Word -> [Word]] -> WordSet edit' w fns = S.fromList $ concat $ zipWith (\x y -> map (\z -> x ++ z) (concatMap (\x -> x y) fns)) (inits w) (tails w) simple :: DistanceMetric simple = DistanceMetric difference differenceEdit edits :: DistanceMetric edits = DistanceMetric editDistance editDistanceEditsThis gives two distance functions and two ways of generating edits. The Levenshtein distance of 1 is generated by transposing, deleting and inserting characters from the original word. This gives us the flexibility, because another distance metric could be put in place (anagrams perhaps?). Next to performance.
buildGraph :: DistanceMetric -> WordSet -> Word -> Node buildGraph d@(DistanceMetric dist edits) wordset top = Node top (map (buildGraph d smaller) neighbours) where possibleNeighbours = edits top neighbours = S.toList (smaller `S.intersection` possibleNeighbours) smaller = S.delete top wordset search :: DistanceMetric -> Node -> Int -> Word -> [Word] search (DistanceMetric dist _) graph maxDepth goal = search' graph [] where search' (Node end children) path | end == goal = end : path | length path >= maxDepth = [] -- too deep | dist end goal >= maxDepth - length path = [] -- too much difference | otherwise = first where -- Find the best node to search by comparing it against the goal costForNextChild :: [(Int,Node)] costForNextChild = zip (map (\(Node x _) -> dist x goal) children) children bestFirst = map snd $ sortBy (comparing fst) costForNextChild -- Best first search childRoutes = filter (not . null) $ map (\child -> search' child (end : path)) bestFirst first | null childRoutes = [] | otherwise = head childRoutesTwo things have changed from the original code. The first is that the graph is built by comparing the edits against the dictionary, rather than the word against the whole dictionary. This is the main saving and makes it *hugely* faster (thanks to jkkramer for the pointer and this post.) The only other change is that we decide which node to search next based on how close it is to the goal (a best-first search). With these changes it can now solve all of the problems I've tried at wordchains.com. Neat. The complete code is available here.
Friday, 12 November 2010
Google Refine
Google Refine is a toy for playing with data, originally developed by the team behind Freebase. It's a tool for taking messy data sources and getting some structure in with them. This is exactly the same sort of thing that I need for my super-secret-world-changing-awesome-side-project. Ok, it's not that super yet, not released and will probably not change the world, but it is fun and it does require adding some structure to the messy data sets that can easily be found on the web.
Refine can import data from a variety of sources from either the web or a data file on disk. To have a quick play with this, I grabbed the list of UK Prime Ministers from here and spent a couple of minutes in Emacs to reformat it as a CSV file. Once you've imported that data into Refine, you can automatically reconcile this data with structured information available online (such as Freebase). This gives you the best matches for each name. The screen below shows the results after reconciling with Freebase.
I'm not sure if it's entirely clear from the image but about 40% of the prime ministers have automatically been correlated with the appropriate entry from Freebase! A few more clicks and the entire data set can be reconciled against an on-line source and then exported in a variety of formats including HTML, CSV and JSON. This is exactly the kind of data matching I'm after, as once you've got a Freebase ID you can look up all the extra information very easily. So easy, it almost feels like cheating!
Refine can import data from a variety of sources from either the web or a data file on disk. To have a quick play with this, I grabbed the list of UK Prime Ministers from here and spent a couple of minutes in Emacs to reformat it as a CSV file. Once you've imported that data into Refine, you can automatically reconcile this data with structured information available online (such as Freebase). This gives you the best matches for each name. The screen below shows the results after reconciling with Freebase.
I'm not sure if it's entirely clear from the image but about 40% of the prime ministers have automatically been correlated with the appropriate entry from Freebase! A few more clicks and the entire data set can be reconciled against an on-line source and then exported in a variety of formats including HTML, CSV and JSON. This is exactly the kind of data matching I'm after, as once you've got a Freebase ID you can look up all the extra information very easily. So easy, it almost feels like cheating!
Subscribe to:
Posts (Atom)