Wednesday, 9 September 2009

Generating Text in Haskell

Previously I implemented an example from ANSI Common Lisp that generated a stream of text based on the word frequencies of a sample document. This produces surprisingly realistic looking text.

The data structures that represent the frequency map are typed as below. type defines a type synonym and are purely used to improve readability (e.g. like typedef in C/C++)

type Followers = Map String Int
type WordSuccessors = Map String Followers

To generate this data structure, we generate all pairs of words, and then add them to an instance of the WordSuccessors structure until complete.

createTrainingSet :: String -> WordSuccessors
createTrainingSet s = foldl' updateMap Map.empty (wordPairs (words s))

updateMap :: WordSuccessors -> (String,String) -> WordSuccessors
updateMap m (x,y) = Map.insert x v m where
q = Map.findWithDefault (Map.singleton y 0) x m
v = Map.insert y (succ (Map.findWithDefault 0 y q)) q

wordPairs :: [String] -> [(String,String)]
wordPairs l = zip l (tail l)

One tip I picked up from Real World Haskell was using undefined. You can use this to sketch out type signatures of functions, prior to putting the body in. For example, updateMap was a pain to get right, but with undefined I can fill in bits of the implementation by replacing undefined with a real implementation. Each time I can use the compiler to validate I haven't gone too far wrong.

updateMap :: WordSuccessors -> (String,String) -> WordSuccessors
updateMap m (x,y) = undefined

-- Refine
updateMap m (x,y) = Map.insert x v m where
q = undefined
v = undefined

-- Refine
updateMap m (x,y) = Map.insert x v m where
q = Map.findWithDefault (Map.singleton y 0) x m
v = undefined

-- Refine
updateMap m (x,y) = Map.insert x v m where
q = Map.findWithDefault (Map.singleton y 0) x m
v = Map.insert y (succ (Map.findWithDefault 0 y q)) q

Types help bound your program to make sure it does what you think. Developing in this way draws parallels with test-driven development. Instead of a suite of tests (which I have to write properly!), I can use the type system to verify the program still does what I think it should do! I realize that types don't pick up everything (maybe dependent types will minimize this in the future), but then neither do unit tests!

Next step is to generate random text based on this word successors information. Random functions aren't pure (obviously) and I went a long way down various wrong roads (mostly involving changing the return type of nextWord to IO(String), propagating everything to type IO and then realizing I'm going down the wrong road). On IRC (#Haskell) someone suggested an idiom of passing an infinite list of random numbers to a function - this keeps it pure (the impurity is handled in generating the list) at the cost of an additional argument.

The algorithm is comically naive, but it's just about fast enough. I couldn't find a better way of selecting an element based on a frequency distribution stored in a map though I'm sure there is one!

nextWord :: [Int] -> WordSuccessors -> String -> ([Int],String)
nextWord seeds fm start = (r, (poss !! (mod s count))) where
successors = fm Map.! start
count = Map.fold (+) 0 successors
poss = Map.foldWithKey (\k v acc -> (replicate v k) ++ acc) [] successors
s = head seeds
r = drop 1 seeds

The function returns the next word given the previous, and also shifts the random number sequence along again.

Finally, we need something to drive this. main returns a type IO(String) because it has to perform two bits of IO (reading the sample file and generating the random number).

maxWordCount :: Int
maxWordCount = 1000000

main :: IO(String)
main = do
text <- readFile exampleFile
gen <- newStdGen
let training = createTrainingSet text
seeds = randomRs (0,maxWordCount) gen
return (unwords (map snd (iterate (\(s,w) -> nextWord s training w) (seeds,"by"))))

So finally, we can generate an infinite sequence of random-ish text. Here's an example, based on "Manual of Surgery". Apparently it was the most popular book downloaded. Weird and slightly disturbing.

The patient from a long bones, and, by the blood threatens to fat. In the lower or near the blood vessel, and a pustular eruption are usually be inserted, and irregularity in Pus from three weeks or to employ considerable size by a soft, rapidly developed over looked. If, however, regain the compression causes suffering from one collection of small saphena system run an ideal manner of the risk of pain; the patient's strength of the best hope of wounds are believed to the toxic symptoms which the suppuration are reduced by ankylosis. In surgical diseases, from repeated on the limb becomes discoloured and is fluid and upwards, and loss of any prospect of age, and is caseation and the existing sinuses with the victims of pes equino-varus. Pressure of calcium salts, and then forms a severe pain, with the ulcer may attain a wound infection leads to enter the end of pus forms, but may be diagnosed before it is then foul and the scalp or destruction of Hands 529 Hectic Fever 62 12. Chart of a synovitis_ with the arsenical preparations of syphilis, must be a stab or bossed, the natural protective contraction which is sometimes of the inflow of cerebral congestion and on the blisters, and irregular, stalactite-like processes of the gum, with the penis, scrotum, excision or by freely open method of the first thirty-six or applies a globular cyst, or spindle-shaped, and sometimes resulting in the myeloma is often met with the shafts. The operation of Contusions._--The usual phenomena associated in primipar, there is seldom be employed with lymph glands, 322 Submental lymph from the dorsum.

Sounds like believable medical gibberish to me!