Wednesday, 7 January 2009

Regular Expressions in Clojure

Regular Expressions are a incredibly powerful language for performing text manipulation.

ANSI Common Lisp doesn't feature regex support, though there are a large number of libraries that do (see here). Clojure uses the Java implementation of regular expressions (see the regex tutorial) and adds some nice syntactic sugar to the mix.

To define a regex pattern in Clojure you use the #"<regex>" syntax. This is actually the regex pattern, so evaluating it at the REPL gives you itself. The pattern is compiled, so putting anything nonsensical results in an error.

user> #"[0-9]+"

user> #"[]+"
; Evaluation aborted.
; clojure.lang.LispReader$ReaderException: java.util.regex.PatternSyntaxException:
; Unclosed character class near index 2

The Clojure regex functions are well documented. Unfortunately, I didn't really read the documentation! re-find returns either the first match (directly as a string) OR a vector of matches if multiple matches. This is optimized for the normal case where a user enters the text and the regex is well known ahead of time e.g.

user>(re-find #"bar" "bar")

user>(re-find #"(foo)|(bar)" "foo bar")
["foo" "foo" nil]

When learning regular expressions, I found Regex Coach invaluable (which is actually Lisp powered!). It's an application that lets you immediately see how a regular expression matches some text. Let's do the basics of this in Clojure.

Regular Expression Viewer

You have two areas, one for the regular expression and one for the text. As you press keys the matching regular expression (if any) is highlighted in the text area.

Firstly we need a function that given a regular expression and some text returns where to do some highlighting:

(defn first-match [m]
(if (coll? m) (first m) m))

(defn match [regex text]
(let [m (first-match (re-find (re-pattern regex) text))]
(if (nil? m)
[0 0]
(let [ind (.indexOf text m) len (.length m)]
[ind (+ ind len)]))))

first-match is a helper function that gives the first match (handling the case of re-find returning multiple entries). match just gives you a vector [x y] representing the index of the match.

Next a bit of UI that features a couple of new things:

Each time a key is pressed match returns the area to highlight and Highlighter does the rest. The exception handling ensures that if there was an error compiling the regex then that's printed on the status bar.

(defn regexcoach []
(let [frame (JFrame. "Regular Expression Coach") pane (JPanel.) regexText (JTextField.)
targetText (JTextField. "")
statusBar (JLabel. "Match from 0 to 0")
keyHandler (proxy [KeyAdapter] []
(keyTyped [keyEvent]
(let [m (match (.getText regexText) (.getText targetText))
hl (.getHighlighter targetText)
pen (DefaultHighlighter$DefaultHighlightPainter. Color/RED)]
(.removeAllHighlights hl)
(.addHighlight hl (first m) (second m) pen)
(.setText statusBar (format "Match from %s to %s" (first m) (second m))))
(catch PatternSyntaxException e (.setText statusBar (.getMessage e))))))]
(doto regexText
(.addKeyListener keyHandler))
(doto targetText
(.addKeyListener keyHandler))
(doto pane
(.setLayout (BoxLayout. pane BoxLayout/Y_AXIS))
(.add (JLabel. "Regular Expression"))
(.add regexText)
(.add (JLabel. "Target String"))
(.add targetText)
(.add statusBar))
(doto frame
(.add pane)
(.setSize 300 300)
(.setVisible true))))

Full code here.