Wednesday, 16 September 2009

Parsing Logs with Haskell

I spend way too much of my time digging through gigabytes of log files looking for interesting events. I've looked at this before using Clojure, so I thought I would try to translate the basics into Haskell and then (over the course of the next few weeks) try to develop into something general purpose and halfway useful.

Building a report from a log file (or collection of log files) typically requires:

  1. A representation of the events I'm interested in
  2. A predicate for determining whether the line is an event (String -> Bool)
  3. Converting the line into an event (String -> Event)
  4. Folding the results into a single report ([Event] -> Report)

In this example, I wanted to parse my dpkg logs (/var/log/dpkg.log) to see how many times I've upgraded packages on Ubuntu. Firstly I defined a type to represent an upgrade event:

type Package = String

data Upgrade = Upgrade { packageName :: Package
, updateTime :: UTCTime }

instance Show Upgrade where
show a = show (updateTime a) ++
":" ++ show (packageName a)

instance Show Upgrade is similar to deriving Show, but allows you to customize how the object will be converted to a string.

Next step is to get a predicate and to parse the lines. I decided to combine these into one function and return Maybe Upgrade to indicate success / failure. I used the Date.Time modules to parse the data and time. The parsing is terrible but suffices for now as I'm just trying to get an idea of where I need to generalize. Note to self read (and ideally understand!) Monadic Parsing Combinators [PDF] and associated Haskell module.

getTime :: String -> UTCTime
getTime = fromJust . parseTime defaultTimeLocale timeFormat

getPackageName :: String -> String
getPackageName = takeWhile (not . Char.isSpace)

-- Poor mans parsing.
parseLine :: String -> Maybe Upgrade
parseLine s
| isInfixOf " upgrade " s = Just
(takeWhile (not . Char.isSpace) (drop 28 s))
(getTime (take 20 s)))
| otherwise = Nothing

All I need to do now is a combining action to perform with foldl. For this I've defined a report of type Map Day [Package] which represents an association between a day and all the names of the packages updated on that day.

processFile :: FilePath -> IO([Upgrade])
processFile s = do
a <- readFile s
return (Maybe.mapMaybe parseLine (lines a))

type Report = Map Day [Package]

combine :: [Upgrade] -> Report
combine = foldl addToReport Map.empty

addToReport :: Report -> Upgrade -> Report
addToReport r p = Map.insert day packages r where
day = utctDay (updateTime p)
initVal = Map.findWithDefault [] day r
packages = packageName p:initVal

reportFile :: FilePath -> IO()
reportFile f = do
a <- processFile f
print (combine a)
return ()

Hurrah, so now I get output in the right format and I can see that I really shouldn't have added some of the Firefox 3.5 bleeding edge repositories to my Ubuntu upgrade paths. Upgrading Firefox (or Shiretoko) every few days is a bad thing. D'oh!

The next stage for me to understand is how I can generalize this. What I really want to develop next is a simple pluggable framework. It seems that I need to generalize at least the following bits:

  • Parsing a line into a type T
  • Combining [T] to produce a single report