Friday, 9 January 2009

Generating Text

Often you need a chunk of vaguely real looking text to test some code, web layout, file handling etc. In ANSI Common Lisp (which I'm currently reading if you hadn't guessed) there's an example of how to generate random text.

The idea is simple - read in a list of words, for each pair of words keep a count of the number of occurrences of that pair. Once you've got that data, you can pick a word at random and then pick from a probability distribution what the next word should be. Apply that pattern until you've generated enough text. There's much more sophisticated work based on same ideas.

The example Lisp code is quite nasty and it's written in what feels like an iterative style. In Clojure we have access to a richer standard library (Java). For example, we can read in a list of words thus:


(defn file-as-wordlist [f]
(filter (fn [x] (> (count x) 0)) (.split (slurp f) "\n|[ ]|\r|[.]|[,]|[\"]")))


This takes a file name as an arguments, slurps the entire file into memory and splits it using String.split

Next we need to be a frequency map which has an entry for each word, together with a count of each word that follows it. We use a map of Word => (Word => Count).


(defn build-frequency-map [words]
(let [word-pairs (mapcat (fn [x y] (list [x y])) (cons (last words) words) words)]
(reduce (fn [accum v]
(let [w1 (first v) w2 (second v) val (get accum w1)]
(if (nil? val)
(assoc accum w1 {w2 1})
(let [currentVal (get val w2)]
(if (nil? currentVal)
(assoc accum w1 (conj val {w2 1}))
(assoc accum w1 (conj val {w2 (inc currentVal)})))))))
{} word-pairs)))


This is a beefy function, but I couldn't see how to simplify it. It also assumes that the last word maps to the first, which is daft but then this is only playing around! The idea is to get the two sequences, shift one to the left and then count the two in a map. For example the frequency map of Caroline Kennedy would be, well you know?, something like this:


user> (build-frequency-map ["you" "know" "it's" "great" "you" "know"])
{"know" {"it's" 1, "you" 1}, "great" {"you" 1}, "it's" {"great" 1}, "you" {"know" 2}}


So once we've generated a frequency map we need to work out what the next work should be based on the probability. "know" is followed by 1 occurrence of either "it's" or "you", so each should have a 50/50 chance of getting picked.


(defn frequency-map-count [m word]
(let [v (get m word)]
(if (nil? v)
0
(reduce (fn [x y] (+ x (second y))) 0 v))))

(defn next-word [m word]
(let [following (get m word) p (rand-int (frequency-map-count m word))]
((fn [words prob]
(let [word-count (second (first words))]
(if (>= word-count prob)
(first (first words))
(recur (rest words) (- prob word-count))))) following p)))


frequency-map-count counts the number of occurrences of a word. next-word rolls a dice and selects the next word based on the number of occurrences.

Finally, all we need is a function to apply next-word


(defn generate-text [example n & [start]]
(let [words (file-as-wordlist example) fm (build-frequency-map words)
start-word (if (nil? start) "the" start)]
(apply str (interpose " " (take n (iterate (partial next-word fm) start-word))))))


There's several nice things here from a functional programming point of view:
* Use of partial the function of a single argument (currying)
* Use of iterate/take to have an infinite stream of text (laziness)

So is it any good? Here's 250 words Dracula


user> (generate-text "/home/jfoster/Documents/345.txt" 250)


"the same condition of those which pretend and said Finis and I agree later period He evidently the geologic and women jumped down one whom each side manifestly as to set in them he had pressed upon us? Be not in through my poor Lucy noticed it up the section their swords can be more and had expected A great brain just finished the third person or sea I do for the belt with so sweet little before he woke bright smile of it afterwards when we did anent something like a night till tonight it and on in paragraph 1 October --As I describe when Arthur I have unchecked sway a low shrill call him what will have not friend John was no opinion there would indeed happiness were parted at rest Happily Van Helsing hypnotized me But you? Whilst we may deign to be at rest seemed to the facts of a cluster and mixed up the wide spread out one by the library of whips The soft voice gave them both quiet joy when we were they come to the big swan thought struck with my eyelids and pondered over to take away I took out of the free Nay sir bethink yourself It would like it so strange tongues fierce voice came close to bear you consated his hands Our dear Lucy after you shall not look into my hands our intent I will grace said nothing The opiate of white cloud in spirit Perhaps it"


Well, it's not going to win the Booker Prize, but it's almost believable! Full code available here

Other applications of this:
  • Predictive messaging
  • Generating auto-blog posts by "rewriting" other content
  • Language recognition based on work occurrence / frequency