Fatvat: September 2009

Tuesday, 29 September 2009

Graham Scan in Haskell

Most of the exercises at the end of Chapter 3 of Real World Haskell are fairly simple. Find the height of a list, verify a list is a palindrome and so on. The last few however are pretty tough..

The convex hull is the smallest set of points that contain all the other points in the set. The image below from Wikipedia is a much better visual description:

Image from Wikipeda

One approach for finding the convex hull is the Graham Scan algorithm and this is the challenge presented in the book. The algorithm can be broken down into the following steps:

Find the minimum point (with the lowest Y co-ordinate). In the event of a tie, the lowest x co-ordinate determines the minimum. Call this point P.
Sort the remaining points relative to the angle formed by point and P relative to the x-axis.
Consider each point in the sequence. For each point, determine whether it is a left turn or a right turn relative to the previous two points. If it is a "right turn" the second to last point is not in the convex hull. Continue until the list is finished.

To solve this, we'll declare some data structures which should be self explanatory.


  data Direction = Straight
                 | LeftTurn 
                 | RightTurn
                   deriving (Show)

  data Point = Point Double Double 
               deriving (Show,Eq)

Next we define a function that calculates the turn made by three points and returns a Direction. We don't need to calculate the actual angle as the cotangent is proportional to the angle.


  cross :: Point -> Point -> Point -> Double
  cross (Point x1 y1) (Point x2 y2) (Point x3 y3) =  ((x2-x1)*(y3-y1)-(x3-x1)*(y2-y1))

  dist :: Point -> Point -> Double
  dist (Point x1 y1) (Point x2 y2) = sqrt((x1-x2)^2 + (y1-y2)^2)

  turn :: Point -> Point -> Point -> Direction
  turn a b c = makeDirection (cross a b c) where
      makeDirection x | x == 0 = Straight
                      | x < 0 = LeftTurn
                      | x > 0 = RightTurn

In order to find the minimum Point and sort the list we need to define two comparator functions for use with sortBy and minimumBy (both members of Data.List).


  compareYPoint :: Point -> Point -> Ordering
  compareYPoint (Point x1 y1) (Point x2 y2)
      | y1 == y2 = compare x1 x2
      | y1 <= y2  =  LT
      | otherwise =  GT

  compareCross :: Point -> Point -> Point -> Ordering
  compareCross pvt a b = if (angle == EQ) then distance else angle where
      angle = compare (cross pvt a b) 0                 
      distance = compare (dist pvt a) (dist pvt b)

Even though compareCross takes 3 arguments, we can still use this as type Point -> Point -> Ordering by using currying.


  lowestY :: [Point] -> Point
  lowestY = minimumBy compareYPoint

  grahamScan :: [Point] -> [Point]
  grahamScan points = undefined where
      p = nub points
      pvt = lowestY p
      sortedPoints = pvt : (sortBy (compareCross pvt) (delete pvt p))

nub is a function that removes duplicate elements from a list. This gives the skeleton of the program. Now we have to do the complicated bit. From Wikipedia:

The algorithm proceeds by considering each of the points in the sorted array in sequence. For each point, it is determined whether moving from the two previously considered points to this point is a "left turn" or a "right turn". If it is a "right turn", this means that the second-to-last point is not part of the convex hull and should be removed from consideration. This process is continued for as long as the set of the last three points is a "right turn". As soon as a "left turn" is encountered, the algorithm moves on to the next point in the sorted array

Let's take the simplest example possible. For the following points (0,0), (2,0), (2,2), (0,2), (1,1) the convex hull should be everything apart from (1,1).

In order to check how this is going to work, we need to define a quick helper function to make building lists of Point types less verbose.


  pointsFromTupleList :: [(Double,Double)] -> [Point]
  pointsFromTupleList = map (\(x,y) -> Point x y)

  examplePoints :: [Point]
  examplePoints = pointsFromTupleList [(0,0),(2,0),(2,2),(0,2),(1,1)]

Now we can explore the functions that we're going to use and verify they work



  -- The minimum point appears correct
  *Main> lowestY examplePoints                                                   
  Point 0.0 0.0

  -- The points are sorted as expected
  *Main> let sortedPoints = (Point 0 0):(sortBy (compareCross (Point 0 0)) (delete (Point 0 0) examplePoints))

  *Main>sortedPoints
  [Point 0.0 0.0,Point 0.0 2.0,Point 1.0 1.0,Point 2.0 2.0,Point 2.0 0.0]

The next task is to calculate the turns made.


  *Main> getTurns sortedPoints
  [LeftTurn,RightTurn,LeftTurn]

That's not much use because I don't know the points which constitute the turns. Let's tuple up the results:


  getTurns :: [Point] -> [((Point,Point,Point),Direction)]
  getTurns (x:y:z:ps) = ((x,y,z),turn x y z) : getTurns (y:z:ps)
  getTurns _ = []

  *Main> getTurns sortedPoints
  [((Point 0.0 0.0,Point 0.0 2.0,Point 1.0 1.0),LeftTurn),
   ((Point 0.0 2.0,Point 1.0 1.0,Point 2.0 2.0),RightTurn),
   ((Point 1.0 1.0,Point 2.0 2.0,Point 2.0 0.0),LeftTurn)]

The algorithm states that I should ignore the second to last parameter if it is a right turn, and keep it otherwise. Clearly we're close here because the second to last parameter for a right turn is the only point not in the convex hull. One problem with the sortedPoints is that we're not returning to where we started. Let's change the definition again so that we can get back to where we started. We add two nodes at the end so that the extra padding means we consider all triplets


  sortedPoints = pvt : (sortBy (compareCross pvt) (delete pvt p)) ++ [pvt,pvt]

  *Main> getTurns sortedPoints
  [((Point 0.0 0.0,Point 0.0 2.0,Point 1.0 1.0),LeftTurn),
   ((Point 0.0 2.0,Point 1.0 1.0,Point 2.0 2.0),RightTurn),
   ((Point 1.0 1.0,Point 2.0 2.0,Point 2.0 0.0),LeftTurn),
   ((Point 2.0 2.0,Point 2.0 0.0,Point 0.0 0.0),LeftTurn),
   ((Point 2.0 0.0,Point 0.0 0.0,Point 0.0 0.0),Straight)]

So now all I need to do to solve the problem is keep the middle point in all cases where we've got a LeftTurn or Straight. I can simplify this further by just only getting the middle point in the tuple.


  getTurns :: [Point] -> [Point,Direction)]
  getTurns (x:y:z:ps) = (y,turn x y z) : getTurns (y:z:ps)
  getTurns _ = []

  grahamScan :: [Point] -> [Point]
  grahamScan points = map fst (filter (\(x,d) -> d /= RightTurn) (getTurns sortedPoints)) where
      p = nub points
      pvt = lowestY p
      sortedPoints = pvt : (sortBy (compareCross pvt) (delete pvt p)) ++ [pvt,pvt]

  -- Hurrah, it finally works (for this one example).
  *Main> grahamScan examplePoints 
  [Point 0.0 2.0,Point 2.0 2.0,Point 2.0 0.0,Point 0.0 0.0]

One example works, that'll do me as I've learnt enough from this exercise to make it worthwhile. I strongly suspect the way I'm extending the point list by appending the pivot twice at the end is just a hack rather than the right way. There's another definition here which looks cleaner, but that apparently has some issues too.

The complete code looks like this. hlint was run on this and pointed out the "uncurry" function which converts a curried function to a function on pairs. This simplifies the definition of pointsFromTupleList.


  data Direction = Straight
                 | LeftTurn 
                 | RightTurn
                   deriving (Show,Eq)

  data Point = Point Double Double 
               deriving (Show,Eq)

  turn :: Point -> Point -> Point -> Direction
  turn a b c = makeDirection (cross a b c) where
      makeDirection x | x == 0 = Straight
                      | x < 0 = LeftTurn
                      | x > 0 = RightTurn

  cross :: Point -> Point -> Point -> Double
  cross (Point x1 y1) (Point x2 y2) (Point x3 y3) =  (x2-x1)*(y3-y1)-(x3-x1)*(y2-y1)

  dist :: Point -> Point -> Double
  dist (Point x1 y1) (Point x2 y2) = sqrt((x1-x2)^2 + (y1-y2)^2)

  compareCross :: Point -> Point -> Point -> Ordering
  compareCross pvt a b = if angle == EQ then distance else angle where
      angle = compare (cross pvt a b) 0                 
      distance = compare (dist pvt a) (dist pvt b)

  getTurns :: [Point] -> [(Point,Direction)]
  getTurns (x:y:z:ps) = (y,turn x y z) : getTurns (y:z:ps)
  getTurns _ = []

  grahamScan :: [Point] -> [Point]
  grahamScan points = map fst (filter (\(x,d) -> d /= RightTurn) (getTurns sortedPoints)) where
      p = nub points
      pvt = lowestY p
      sortedPoints = pvt : sortBy (compareCross pvt) (delete pvt p) ++ [pvt,pvt]

  compareYPoint :: Point -> Point -> Ordering
  compareYPoint (Point x1 y1) (Point x2 y2)
      | y1 == y2 = compare x1 x2
      | y1 <= y2  =  LT
      | otherwise =  GT

  compareAngle :: Point -> Point -> Point -> Ordering
  compareAngle (Point px py) p1 p2 = compare (angle p2) (angle p1) where
      angle (Point x1 y1) = y1-py / x1-px

  lowestY :: [Point] -> Point
  lowestY = minimumBy compareYPoint

  pointsFromTupleList :: [(Double,Double)] -> [Point]
  pointsFromTupleList = map (uncurry Point)

  examplePoints :: [Point]
  examplePoints = pointsFromTupleList [(0,0),(2,0),(2,2),(0,2),(1,1)]

Saturday, 26 September 2009

Building Basic Types in Haskell

Types are the building blocks of Haskell programs. Algebraic Data Types (not to be confused with Abstract Data types) are one category of types. An algebraic data type is defined as either an enumeration (a set of distinct types) or a discriminated union (a choice between multiple alternatives). For example:


  -- An enumeration type
  data PrimaryColor = Red 
                    | Green
                    | Blue

  *Main> :type Red 
  Red :: PrimaryColor

  -- A discrimated union type
  data MyIntList = Empty
                 | List Int MyIntList
                 deriving Show

  *Main> :t Empty
  Empty :: MyIntList

  *Main> :t List 3 Empty
  List 3 Empty :: MyIntList

Note that MyIntList is a recursively defined data type because it uses itself in the definition. This type of structural type recursion is just the same as calling a function recursively. As long as we have a base case (in this case Empty then we can easily build the structure. The way an instance of a type was constructed is available at runtime. Pattern matching is a way of breaking down a value into constiuent parts. For example, we could (but should not!) write a function to calculate the length of the MyIntList type as follows:


  myIntListLength :: MyIntList -> Integer
  myIntListLength Empty = 0
  myIntListLength (List _ rest) = 1 + myIntListLength rest

There's a good talk about the use of OCaml for financial trading systems available here. Amongst other things the strengths of the pattern matching idea are discussed. In a large system you can change a type, recompile and find out any redundant, missed or impossible cases. In OCaml this seems to be on by default, with GHC you need to use the "-fwarn-incomplete-patterns" option.

Record syntax is an alternative way of specifying a type such that parameters can be referred to be name.


  data Person = Person {
        firstName :: String
      , lastName :: String
      , title :: String
      } deriving (Show)

  -- Normal way of creating a Person
  *Main> Person "Bill" "Bob" "Mr"
  Person {firstName = "Bill", lastName = "Bob", title = "Mr"}

  -- Record syntax allows you to specify parameters by name
  -- and therefore order does not matter
  *Main> Person { title="Mr", firstName="bill", lastName="bob" }
  Person {firstName = "bill", lastName = "bob", title = "Mr"}

The MyIntList type is very poor. It's only defined for lists of type integer. We can improve this by making it a parametrized type. In the example below we defined MyList for any type a.


  data MyList a = Empty
                | MyList a (MyList a)
                deriving (Show)

  -- MyList of numbers
  *Main> :t MyList 3 (MyList 4 Empty) 
  MyList 3 (MyList 4 Empty) :: (Num t) => MyList t

  -- MyList of characters
  *Main> :t MyList 'a' (MyList 'b' Empty)
  MyList 'a' (MyList 'b' Empty) :: MyList Char

You can have multiple parametrized types. For example, if we wanted to create a list that alternated types, we could:


  data MyOddList a b = OddEmpty
                     | MyOddList a (MyOddList b a)
                     deriving (Show)

  *Main> :t MyOddList 3 (MyOddList 'c' (MyOddList 4 OddEmpty))  
  MyOddList 3 (MyOddList 'c' (MyOddList 4 OddEmpty)) :: (Num t) => MyOddList t Char

Monday, 21 September 2009

Real World Haskell (2)

Chapter 2 of Real World Haskell gives a quick overview of types and functions. I think I've already picked up most of the bits in here, so this is just some notes on the things I found new / interesting.

One of Haskell's distinguishing features is its strong type system. Real World Haskell says the three interesting aspects are:

Strong types - The type system guarantees that the program does not contain certain errors (treating an integer as a float). In comparison, something like C allows you to reinterpret the bytes as you see fit (weak typing)
Static Types - At the point the code is built, the compiler knows the type of every expression and value. In comparison dynamically typed languages only know the type of an expression at runtime.
Type Inference - The compiler can automatically infer the types of expressions. Type definitions are optional and Haskell can deduce the required types (however, it seems to be good practise to provide the type signature as a documentation aid).

Lists and tuples are the primary way of grouping elements. Items in a list must all be the same type, items in a tuple can be different types. One interesting point made was that each tuple is a unique type and therefore writing a function to "get the second element from any tuple" is almost impossible. Looking at the types of fst :: (a,b) -> a and snd :: (a,b) -> b, it's easy to see why. In order to get the second element from any tuple, we'd have to define it for an infinite range of tuples (a,b,c), (a,b,c,d) and so on.

When Haskell evaluates an expression it uses normal-order evaluation. Normal-order evaluation simply evaluates expressions as they are needed. This is also known as a lazy evaluation strategy because expressions that don't contribute to the result aren't evaluated. This avoids the special case treatment that short-circuiting operators get in other languages.

Sunday, 20 September 2009

Real World Haskell

I finally got around to ordering Real World Haskell. Currently the only book on Haskell I have is The Haskell School of Expression which is enjoyable, but I've never got around to installing the relevant packages and passive reading only gets you so far!

I thought I'd try to keep notes as I go through the book as that seems to be a good way to prompt myself to actually use books (working through PAIP was very rewarding, so hopefully RWH will be a similar experience!) rather than just adorn my study with expensive door stops. This isn't meant to make particular sense, rather just a brain dump as I read through it.

So on with the book. Chapter 1 covers setting up the environment. There's a few interesting ghci commands I had not seen like :set prompt and :module + M. :set +t will always print the type of an expression. The "it" that is displayed is the name of a special variable which contains the value of the last expression evaluated. :info displays the precedence of operators. For example / has higher precedence than +.

Prelude> :info (+)
class (Eq a, Show a) => Num a where
(+) :: a -> a -> a
...
-- Defined in GHC.Num
infixl 6 +

Prelude> :info (/)
class (Num a) => Fractional a where
(/) :: a -> a -> a
...
-- Defined in GHC.Real
infixl 7 /

An infix operator can be converted to a prefix operator by enclosing the operator in parentheses. Similarly, back-ticks can convert a prefix operator into an infix operator.

Enumerations are a way of producing lists of numbers. The Haskell Cheat Sheet gives a great summary of the possibilities.

[1..3] -> [1,2,3]
[1..] -> [1,2,3,4,...] (infinite list)
[10..1] -> [] (ranges only go forward)
[0, -1 ..] -> [0, -1, -2, ...] (infinite list)
[1,3..10] -> [1,3,5,7,9] (list from 1 to 10 with diff of 2)

-- Also applies to any Enum class
['a' .. 'e'] -> "abcde"
['a', 'd' .. 'z'] -> "adgjmpsvy"

Don't use floating points for enumeration because the behaviour can be "quirky" (rounding errors and so on).

The empty string is a synonym for [] for the type Char,so 'a' : "" -> "a".

Rational numbers are defined in the Data.Ratio module, and constructed with % (e.g. 22 % 7 is nearly Pi).

The exercises showed me a function I'd not seen before. interact is a function which takes a function which takes stdin as input, and returns a string which is written to stdout (e.g. interact :: (String -> String) -> IO()). Also, once I grokked that "wc" was referring to the Unix command it all became a little clearer!

Wednesday, 16 September 2009

To Type Class or not to Type Class

Last time I wrote a very simple log parser, but it was very specific. The next task was to work out how to generalize it.

My first thoughts were to use type classes to make the code more generic. Type classes represent a constraint on a type variable in a parametrically polymorphic type. Parametric polymorphism simply means a generic function can be written so it handles values identically independent of type (the list functions are a good example).

So I started, and ended up typing in gibberish like this:


  class Report r where
      printReport :: r -> String

  class (Eq l, Show l) => LogProcessor l where
      processLine :: String -> Maybe l
      combineUnit :: (Report r) => l -> r -> r

I then read OOP vs type classes and Learning Haskell Notes which made me think that type classes is the wrong approach.

All I'm actually trying to do is change the behaviour of a couple of functions (parsing a line and combining the results). I'm not really interested in any types and there should be as few constraints as possible in terms of what you can implement. The simplest way (and probably the right way) is to just pass these functions in


  processFile :: FilePath -> (String -> Maybe t) -> IO([t])
  processFile path f = do
    a <- readFile path
    return (Maybe.mapMaybe f (lines a))

  reportFile :: FilePath -> (String -> Maybe t) -> ([t] -> String) -> IO()
  reportFile path func comb  = do
    a <- processFile path func
    print (comb a)
    return ()

There aren't really any constraints here. processFile now processes a file path, returning IO([t]). A combining function processes this list and generates a string to give the result. Much simpler to use too.

Parsing Logs with Haskell

I spend way too much of my time digging through gigabytes of log files looking for interesting events. I've looked at this before using Clojure, so I thought I would try to translate the basics into Haskell and then (over the course of the next few weeks) try to develop into something general purpose and halfway useful.

Building a report from a log file (or collection of log files) typically requires:

A representation of the events I'm interested in
A predicate for determining whether the line is an event (String -> Bool)
Converting the line into an event (String -> Event)
Folding the results into a single report ([Event] -> Report)

In this example, I wanted to parse my dpkg logs (/var/log/dpkg.log) to see how many times I've upgraded packages on Ubuntu. Firstly I defined a type to represent an upgrade event:


  type Package = String

  data Upgrade = Upgrade { packageName :: Package
                         , updateTime :: UTCTime } 

  instance Show Upgrade where
      show a = show (updateTime a) ++ 
               ":" ++ show (packageName a)

instance Show Upgrade is similar to deriving Show, but allows you to customize how the object will be converted to a string.

Next step is to get a predicate and to parse the lines. I decided to combine these into one function and return Maybe Upgrade to indicate success / failure. I used the Date.Time modules to parse the data and time. The parsing is terrible but suffices for now as I'm just trying to get an idea of where I need to generalize. Note to self read (and ideally understand!) Monadic Parsing Combinators [PDF] and associated Haskell module.


  getTime :: String -> UTCTime
  getTime = fromJust . parseTime defaultTimeLocale timeFormat

  getPackageName :: String -> String
  getPackageName = takeWhile (not . Char.isSpace)
               
  -- Poor mans parsing.
  parseLine :: String -> Maybe Upgrade
  parseLine s 
            | isInfixOf " upgrade " s = Just 
                                        (Upgrade 
                                         (takeWhile (not . Char.isSpace) (drop 28 s))
                                         (getTime (take 20 s)))
            | otherwise = Nothing

All I need to do now is a combining action to perform with foldl. For this I've defined a report of type Map Day [Package] which represents an association between a day and all the names of the packages updated on that day.


  processFile :: FilePath -> IO([Upgrade])
  processFile s = do 
    a <- readFile s
    return (Maybe.mapMaybe parseLine (lines a))

  type Report = Map Day [Package]

  combine :: [Upgrade] -> Report
  combine = foldl addToReport Map.empty 

  addToReport :: Report -> Upgrade -> Report
  addToReport r p = Map.insert day packages r where
                    day = utctDay (updateTime p)
                    initVal = Map.findWithDefault [] day r
                    packages = packageName p:initVal

  reportFile :: FilePath -> IO()
  reportFile f = do
    a <- processFile f
    print (combine a)
    return ()

Hurrah, so now I get output in the right format and I can see that I really shouldn't have added some of the Firefox 3.5 bleeding edge repositories to my Ubuntu upgrade paths. Upgrading Firefox (or Shiretoko) every few days is a bad thing. D'oh!

The next stage for me to understand is how I can generalize this. What I really want to develop next is a simple pluggable framework. It seems that I need to generalize at least the following bits:

Parsing a line into a type T
Combining [T] to produce a single report

Saturday, 12 September 2009

HLint

I found mention of HLint via this post. Installing it was as simple as the instructions provided. The binary ended up in $HOME/.cabal/bin which confused me for a few minutes.

A quick run over my code showed many improvements that I could make. For example. I don't make as much use of eta reduction as I could and HLint not only tells me this, but suggests a solution.


  ./randomText.hs:15:1: Warning: Eta reduce
  Found

  createTrainingSet s
    = foldl' updateMap Map.empty (wordPairs (words s))

  Why not

  createTrainingSet = foldl' updateMap Map.empty . wordPairs . words

I also use far too many brackets in my code. That's a combination of not knowing the precedence rules and switching over from Clojure! Again, a warning is provided and the minimum brackets required shown.

HLint also finds common patterns in code and suggests rewriting them in terms of fold or map. For example when I was trying to write my own map:


  ./myfunctions.hs:11:1: Error: Use map
  Found

  mymap f [] = []
  mymap f (x : xs) = f x : mymap f xs

  Why not

  mymap f xs = map f xs

And similarly for written my version of length:


  ./myfunctions.hs:3:1: Warning: Use foldl
  Found

  mylength [] count = count
  mylength (x : xs) count = mylength xs (1 + count)

  Why not

  mylength xs count = foldl (\ count x -> 1 + count) count xs

I'll definitely be making use of this before I post any more Haskell code!

Wednesday, 9 September 2009

Generating Text in Haskell

Previously I implemented an example from ANSI Common Lisp that generated a stream of text based on the word frequencies of a sample document. This produces surprisingly realistic looking text.

The data structures that represent the frequency map are typed as below. type defines a type synonym and are purely used to improve readability (e.g. like typedef in C/C++)


  type Followers = Map String Int
  type WordSuccessors = Map String Followers

To generate this data structure, we generate all pairs of words, and then add them to an instance of the WordSuccessors structure until complete.


  createTrainingSet :: String -> WordSuccessors
  createTrainingSet s =  foldl' updateMap Map.empty (wordPairs (words s))

  updateMap :: WordSuccessors -> (String,String) -> WordSuccessors
  updateMap m (x,y) = Map.insert x v m where
                      q = Map.findWithDefault (Map.singleton y 0) x m
                      v = Map.insert y (succ (Map.findWithDefault 0 y q)) q
    
  wordPairs :: [String] -> [(String,String)]
  wordPairs l = zip l (tail l)

One tip I picked up from Real World Haskell was using undefined. You can use this to sketch out type signatures of functions, prior to putting the body in. For example, updateMap was a pain to get right, but with undefined I can fill in bits of the implementation by replacing undefined with a real implementation. Each time I can use the compiler to validate I haven't gone too far wrong.


  updateMap :: WordSuccessors -> (String,String) -> WordSuccessors
  updateMap m (x,y) = undefined

  -- Refine
  updateMap m (x,y) = Map.insert x v m where
                      q = undefined 
                      v = undefined 

  -- Refine 
  updateMap m (x,y) = Map.insert x v m where
                      q = Map.findWithDefault (Map.singleton y 0) x m
                      v = undefined 

  -- Refine
  updateMap m (x,y) = Map.insert x v m where
                      q = Map.findWithDefault (Map.singleton y 0) x m
                      v = Map.insert y (succ (Map.findWithDefault 0 y q)) q

Types help bound your program to make sure it does what you think. Developing in this way draws parallels with test-driven development. Instead of a suite of tests (which I have to write properly!), I can use the type system to verify the program still does what I think it should do! I realize that types don't pick up everything (maybe dependent types will minimize this in the future), but then neither do unit tests!

Next step is to generate random text based on this word successors information. Random functions aren't pure (obviously) and I went a long way down various wrong roads (mostly involving changing the return type of nextWord to IO(String), propagating everything to type IO and then realizing I'm going down the wrong road). On IRC (#Haskell) someone suggested an idiom of passing an infinite list of random numbers to a function - this keeps it pure (the impurity is handled in generating the list) at the cost of an additional argument.

The algorithm is comically naive, but it's just about fast enough. I couldn't find a better way of selecting an element based on a frequency distribution stored in a map though I'm sure there is one!


  nextWord :: [Int] -> WordSuccessors -> String -> ([Int],String)
  nextWord seeds fm start = (r, (poss !! (mod s count))) where 
                            successors = fm Map.! start
                            count = Map.fold (+) 0 successors
                            poss = Map.foldWithKey (\k v acc -> (replicate v k) ++ acc) [] successors
                            s = head seeds
                            r = drop 1 seeds

The function returns the next word given the previous, and also shifts the random number sequence along again.

Finally, we need something to drive this. main returns a type IO(String) because it has to perform two bits of IO (reading the sample file and generating the random number).


  maxWordCount :: Int
  maxWordCount = 1000000

  main :: IO(String)
  main = do
    text <- readFile exampleFile
    gen <- newStdGen
    let training = createTrainingSet text
        seeds = randomRs (0,maxWordCount) gen
    return (unwords (map snd (iterate (\(s,w) -> nextWord s training w) (seeds,"by"))))

So finally, we can generate an infinite sequence of random-ish text. Here's an example, based on "Manual of Surgery". Apparently it was the most popular book downloaded. Weird and slightly disturbing.

The patient from a long bones, and, by the blood threatens to fat. In the lower or near the blood vessel, and a pustular eruption are usually be inserted, and irregularity in Pus from three weeks or to employ considerable size by a soft, rapidly developed over looked. If, however, regain the compression causes suffering from one collection of small saphena system run an ideal manner of the risk of pain; the patient's strength of the best hope of wounds are believed to the toxic symptoms which the suppuration are reduced by ankylosis. In surgical diseases, from repeated on the limb becomes discoloured and is fluid and upwards, and loss of any prospect of age, and is caseation and the existing sinuses with the victims of pes equino-varus. Pressure of calcium salts, and then forms a severe pain, with the ulcer may attain a wound infection leads to enter the end of pus forms, but may be diagnosed before it is then foul and the scalp or destruction of Hands 529 Hectic Fever 62 12. Chart of a synovitis_ with the arsenical preparations of syphilis, must be a stab or bossed, the natural protective contraction which is sometimes of the inflow of cerebral congestion and on the blisters, and irregular, stalactite-like processes of the gum, with the penis, scrotum, excision or by freely open method of the first thirty-six or applies a globular cyst, or spindle-shaped, and sometimes resulting in the myeloma is often met with the shafts. The operation of Contusions._--The usual phenomena associated in primipar, there is seldom be employed with lymph glands, 322 Submental lymph from the dorsum.

Sounds like believable medical gibberish to me!

Friday, 4 September 2009

Exploring Haskell's List Functions

Haskell's primary data type is the list and there are a rich variety of functions for transforming lists. Most of these functions are familiar to me from Clojure sequences but there are some other useful functions that I haven't seen before.

last, as you'd expect, gives you the last item in a list. It has a corresponding function init that gives you everything but the last item. intersperse allows you to insert an element between each element (e.g. init (intersperse 'a' "bnnn") => "banana" . Similarly, intercalate inserts the lists between lists and concatenates the results. For different arrangements of the items, there are two functions. subsequences gives you a list of all subsequences of the argument and permuatations gives all the possible permutations.


  Prelude> subsequences "abc"          
  ["","a","b","ab","c","ac","bc","abc"]

  Prelude> permutations "abc"
  ["abc","bac","cba","bca","cab","acb"]

To reduce lists to a single value there are many versions of the fold function (in Clojure there was just reduce!). foldl is the direct equivalent to reduce which reduces a list to a single binary operator. foldl (+) 0 [1,2,3] is equivalent to (((0 + 1) + 2) + 3). foldl1 is a convenience method without the initial argument, that only applies to non-empty lists foldr and foldlr1 are right-associative so foldr (+) 0 [1,2,3] evaluates to (0 + (1 + (2 + 3))).

The fold family of functions can be extremely powerful - I need to read "A tutorial on the universality and expressiveness of fold" [PDF]that explores this in more detail. Some example functions that operate on lists defined as folds in Haskell include and, or, any, all and concatMap.

On a side note, there are strict versions of the foldl functions, that with a '. Why'd you need these? Haskell is lazy by default which can mean you build up a great big thunk (a pending calculation). This can be bad (for example, increased space requirements). By making a strict version you evaluate the values as they becomes available. This stops the huge think building up and can be more efficient. There's a good discussion of this here. foldr doesn't have a corresponding strict version, and looking at the expansion it's easy to see why - there's nothing to be lazy with as the first value you can evaluate is right at the end of the list!

unfoldr is an interesting function. It can be considered the opposite of foldr as it is used to build up a list from a seed value. The first argument is a function that returns Nothing if it's finished or Just (a,b) otherwise. a is prepended to the list, and b is used as the next element. For example we can generate the Fibonacci sequence:


  fibonacci :: [Integer]
  fibonacci = unfoldr (\[a,b] -> Just(a+b,[b,b+a])) [0,1]

  Prelude> take 10 fibonacci
  [1,2,3,5,8,13,21,34,55,89]

iterate can be written as iterate f == unfoldr (\x -> Just (x, f x)). Another paper to add to the reading list is "The under appreciated unfold".

take and drop are functions for getting prefixes and suffixes of (potentially) infinite lists. splitAt does both at the same time, returning a tuple (take n xs, drop n xs). takeWhile and dropWhile take or drop errors whilst some predicate holds. Putting these together we can write a function groupN which groups elements into sublists of size N.


  groupN :: [a] -> Int -> [[a]]
  groupN [] _ = []
  groupN xs n = a : groupN b n where
                (a,b) = splitAt n xs

  Prelude> groupN [1..7] 2
  [[1,2],[3,4],[5,6],[7]]

The Haskell list library is very complete and there's definitely some new ideas for me to absorb there. In particular, understanding unfoldr and folding in more detail seems to be an important thing to do!

Wednesday, 2 September 2009

Debugging in Haskell

One thing I was finding a little frustrating is that if I was to do any IO in a function, then it's type signature changes. This is good most of the time, but it's a little painful to have to change all your code to print out a number!

Thankfully, there's a number of solutions.

The GHCi debugger provides a way of inspecting code. :b N sets a break point in the loaded module at a specific line. Taking the anagrams example I set :b 30 on anagrams <-anagramList wordfile.


  anagramsOf :: String -> IO ()
  anagramsOf word = do
    anagrams <- anagramList wordfile
    putStrLn (show (Map.lookup (stringToKey word) anagrams))

When the program is run and the breakpoint is hit we get the following:


  Stopped at anagrams.hs:31:2-57
  _result :: IO () = _
  anagrams :: Map String (Set String) = _
  word :: String = _
  [anagrams.hs:31:2-57] *Main> :list
  30    anagrams <- anagramList wordfile
  31    putStrLn (show (Map.lookup (stringToKey word) anagrams))
  32

:list is used to list of the free variables in scope (anagrams and word) and these are available to inspect in the debugger. _result is a binding for the result expression. Once you've hit a breakpoint, you can use use :trace to continue to the next breakpoint, recording the history as you go along. :back and :forward allow you to go up and down the list of evaluated expressions and inspect each one.

The Haskell Wiki has a section devoted to debugging which brought me to the Debug.Trace module. This allows you to print some text and return the evaluation of the next expression e.g. Debug.Trace.trace "1+1=" (1 + 1).

Initially, I couldn't understand how the type of Debug.Trace.trace could be String -> a -> a, but then I found System.IO.Unsafe. It comes with a large health warning that it doesn't enforce ordering on IO and it's type unsafe. Evil, yet useful (at least to write the trace functions).

Tuesday, 1 September 2009

Generating ASCII Art

ASCII art is a way of presenting graphics through standard characters, without the need for images. For example, you could draw a triangle, using just forward and backwards characters.


  /\
 /__\

Given an arbitrary picture, how do we convert this to ASCII art? The technique is dead simple - convert the image to gray scale and replace each pixel in the image with a character representing the brightness value. For example, a * character is darker than a ! character.

The following Haskell program does just that. It uses the PGM package to load an image (so it's already converted to gray scale). All it does it map the pixel to a character using a function and amap and then do some jiggery pokery to turn it into an image.


import Graphics.Pgm

import Text.Parsec.Error
import Data.Array.Base

brightness =  " .`-_':,;^=+/\"|)\\<>)iv%xclrs{*}I?!][1taeo7zjLu" ++ 
              "nT#JCwfy325Fp6mqSghVd4EgXPGZbYkOA&8U$@KHDBWNMR0Q";

loadImage :: String -> IO (UArray (Int,Int) Int)
loadImage path = do
                    r <- pgmsFromFile path
                    case r of
                      Left e -> error "Failed to parse file"
                      Right i -> return (head i)

brightnessToChar :: Int -> Int -> Char
brightnessToChar m b = brightness !! 
                       (round ((fromIntegral b) / (fromIntegral m) * (fromIntegral ((length brightness) - 1))))

imageToAscii :: UArray (Int,Int) Int -> UArray (Int,Int) Char
imageToAscii image = amap (brightnessToChar 255)  image
     
convertImage :: String -> String -> IO ()
convertImage image out = do
                       img <- loadImage image
                       let ((_,_),(h,w)) = bounds img
                       let x = imageToAscii img
                       writeFile out (unlines [ [ x ! (i,j) | i <- [0..w] ] | j <- [0..h] ])
                       return ()

Interesting learning exercises for me were the nested list comprehension to go through the array and the use of Either to represent a choice of return values. This seems to be used when you want to return more data than simply "it failed" (which you'd use Maybe).

Because I can't quite work out how to get a tiny font, it's easier to post a screen shot of some ASCII art (kind of defeats the purpose, I know!). Below is the Ubuntu logo rendered (badly!) as ASCII art.