Skip to content

Writing a Simple Parser

blancas edited this page Feb 12, 2013 · 11 revisions

This tutorial is based on chapter 8, Functional Parsers, from Programming in Haskell, by Graham Hutton. The idea is to parse text by taking one character at a time from the front of an input string and returning a value and the rest of the string. In order to parse anything more than a few characters is necessary to repeat this process on the resulting value:

[input] --> [parser] --> [value rest]
                                  --> [parser] --> [new-value rest]
                                                                --> ...

The second parser will usually combine the result of the first in order to create the desired parsed value while advancing on the input. Thus we have two changing computations: the parsed value and the rest of the input string. We want to set this process in motion while keeping our functions pure of side effects. For that we'll use a custom monad.

For our new monad we need a data type, which we'll define after loading the Morph core namespace.

(use 'blancas.morph.core)

(deftype Parser [f])

(defn parse [p s]
  ((.f p) s))

This is a simple parser and we won't need more than a field for the boxed function. A parser function takes a string and returns a vector of two elements: the value and the rest of the input. The parsed value may have come directly from the immediate input or may be a combination of previous values. The function parse applies a parser by extracting the boxed function.

result is so simple that it may not seem like a real parser. It takes a value and returns a parser that, when run by parse, returns the given value and the input unchanged. The usefulness of this parser will become apparent shortly.

(defn result [v]
  (->Parser (fn [s] [v s])))

(parse (result "DC-10") "foobar")
;; ["DC-10" "foobar"]
(parse (result 747) "a plane")
;; [747 "a plane"]

The function parse extracts the function from the Parser instance and calls it with its second (string) argument.

fail is another primitive parser. It creates a Parser instance that, when called by parse returns an empty vector, regardless of the input. This indicates that the parser failed.

(def fail (->Parser (fn [s] [])))

(parse fail "foobar")
;; []

item creates a Parser instance that, when called by parse returns a vector with two elements. The first element is the first character in the input; the second element is the rest of the input. If the input is empty, the parser's result is the empty vector.

(def item (->Parser (fn [[x & y]] (if x [x y] []))))

(parse item "xyz")
;; [\\x (\\y \\z)]
(parse item "yz")
;; [\\y (\\z)]
(parse item "z")
;; [\\z nil]
(parse item "")
;; []

We now have three basic parsing functions. One creates a new Parse instance; a second fails the parser; and a third extracts the first character from the input. The next thing to do is a way to chain the calls to parsers so that we can process and analyze the input. For example, if we wanted to get the second character, we could do something like this:

(let [[v s](parse item "xyz")]
  (parse item s))
;; [\\y (\\z)]
(parse item ((parse item "xyz") 1))
;; [\\y (\\z)]

But this kind of code would be very limited and not able to scale for repetition and more complex logic. What we'll do instead is take advantage of the plumbing in Morph to do the above sequentially, as in imperative programming. The code below implements the Monad protocol. The function >>= (pronounced "then") takes a Parser instance and a function that takes one value and returns a new Parser instance. Thus with multiple calls to >>= we'll be able to chain parser calls: the fist parser gets chained with the one returned by the function. Let's take a look at our implementation of >>=:

(extend-type Parser
  Monad
    (>>= [m k]
      (->Parser (fn [s]
                  (let [[v s'] (parse m s)]
                    (if v (parse (k v) s') []))))))

We can see that >>= is a high-order function, though the returned function is wrapped in a Parser instance. When run by parse this function in turn calls parse on its first argument; then calls parse on the parser returned by the supplied k function, which takes the first parser's value as its argument. Note that the second call to parse uses the rest of the input.

Once this is implemented, we can use the >>= function to get the second character this way:

(parse (>>= item (fn [x] item)) "xyz")
;; [\\y (\\z)]

The above example gets simpler with the monad macro, which is used for sequential calls to parsers. This macro converts the code to calls to >>= chaining functionality. We can now get the second character much easier:

(parse (monad [_ item] item) "xyz")
;; [\\y (\\z)]

Note that we don't have to write the function k anymore. The monad macro does so automatically as we just have to return a Parser instance as the last value. Using this macro we can now define more useful parsers.

sat enforces whether the next character satisfies the supplied predicate. This parser uses (monad) to evaluate the item parser. The value bound to x will be just the next character, not the vector. If the predicate returns true, the (monad) returns a parser created with (result); or fail otherwise. Using (result x) where x is a single value or a combination of parsed values is a common idiom.

(defn sat [pred]
  (monad [x item]
    (if (pred x) (result x) fail)))

Having now a way to check for a type of character we can look for digits:

(def digit (sat #(Character/isDigit %)))

(parse digit "787")
;; [\\7 (\\8 \\7)]
(parse digit "xyz")
;; []

Just the same we can declare parser of other types of characters:

(def letter   (sat #(Character/isLetter %)))
(def lower    (sat #(Character/isLowerCase %)))
(def upper    (sat #(Character/isUpperCase %)))
(def alphanum (sat #(Character/isLetterOrDigit %)))
(def wspace   (sat #(Character/isWhitespace %)))

(parse letter "xyz")
;; [\\x (\\y \\z)]
(parse lower "xyz")
;; [\\x (\\y \\z)]
(parse upper "ABC")
;; [\\A (\\B \\C)]
(parse alphanum "007")
;; [\\0 (\\0 \\7)]
(parse alphanum "Bond")
;; [\\B (\\o \\n \\d)]
(parse wspace "\\t\\n")
[\\tab (\\newline)]

We now have enough to parse simple letter combinations:

(parse (monad [x letter y digit]
         (result [x y])) "U2")
;; [[\\U \\2] nil]