-
Notifications
You must be signed in to change notification settings - Fork 4
Writing a Simple Parser
This tutorial is based on chapter 8, Functional Parsers, from Programming in Haskell, by Graham Hutton, with some minor simplifications and some additions. The idea is to parse text by taking one or more character at a time from the front of an input string and returning a value and the rest of the string. In order to parse anything more than a few characters is necessary to repeat this process on the resulting value:
[input] --> [parser] --> [value rest]
--> [parser] --> [new-value rest]
--> ...
The second parser will usually combine the resulting value of the first in order to create the desired parsed value while advancing on the input. Thus we have two changing computations: the resulting value and the rest of the input string. We want to set this process in motion while keeping our functions pure of side effects. For that we'll use a custom monad.
For our new monad we need a data type, which we'll define after loading the Morph core
namespace.
(use 'blancas.morph.core)
(deftype Parser [f])
(defn parse [p s]
((.f p) s))
This is a simple parser and we won't need more than a field for the boxed function. A parser function is one that takes a string and returns a vector of two elements: the value and the rest of the input. The parsed value may have come directly from the immediate input or may be a combination of previous values, as we'll see shortly. The function parse
applies a parser by extracting the boxed function.
result
is so simple that it may not seem like a real parser. It takes a value and returns a parser that, when run by parse
, returns the given value and the input unchanged. The usefulness of this parser will become apparent shortly.
(defn result [v]
(->Parser (fn [s] [v s])))
(parse (result "DC-10") "foobar")
;; ["DC-10" "foobar"]
(parse (result 747) "a plane")
;; [747 "a plane"]
The function parse
extracts the function from the Parser instance and calls it with its second (string) argument.
fail
is another primitive parser. It creates a Parser instance that, when called by parse
returns an empty vector, regardless of the input. This indicates that the parser failed.
(def fail (->Parser (fn [s] [])))
(parse fail "foobar")
;; []
item
creates a Parser instance that, when called by parse
returns a vector with two elements. The first element is the first character in the input; the second element is the rest of the input. If the input is empty, the parser's result is the empty vector.
(def item (->Parser (fn [[x & y]] (if x [x y] []))))
(parse item "xyz")
;; [\x (\y \z)]
(parse item "yz")
;; [\y (\z)]
(parse item "z")
;; [\z nil]
(parse item "")
;; []
We now have three basic parsing functions. One creates a new Parse instance; a second fails the parser; and a third extracts the first character from the input. The next thing to do is a way to chain the calls to parsers so that we can process and analyze the input. For example, if we wanted to get the second character, we could be something like this:
(let [[v s](parse item "xyz")]
(parse item s))
;; [\y (\z)]
(parse item ((parse item "xyz") 1))
;; [\y (\z)]
But this kind of code would be very limited and not able to scale for repetition and more complex logic. What we'll do instead is take advantage of the plumbing in Morph to do the above sequentially, as in imperative programming. The code below implements the Monad
protocol. The function >>=
(pronounced "then") takes a Parser instance and a function that takes one value and returns a new Parser instance. Thus with multiple calls to >>=
we'll be able to chain parser calls.
What out implementation needs to do is to call parse
on m
to get the result as [v s']. If we get a value from parsing m
then we call k
passing the value we got from parsing m
as the argument. Then we parse the k
's return value with the rest of the input s'
. Bear in mind that k
must return a Parser
instance.
(extend-type Parser
Monad
(>>= [m k]
(->Parser (fn [s]
(let [[v s'] (parse m s)]
(if v (parse (k v) s') []))))))
Once this is implemented, we can use the >>=
function to get the second character this way:
(parse (>>= item (fn [x] item)) "xyz")
;; [\y (\z)]
The above example gets simpler with the monad
macro, which is used for sequential calls to parsers. This macro converts the code to calls to >>=
chaining functionality. We can now get the second character much easier:
(parse (monad [_ item] item) "xyz")
;; [\y (\z)]
Note that we don't have to write the function k
anymore. The monad
macro does so automatically as we just have to return a Parser instance as the last value. Using this macro we can now define more useful parsers.
sat
enforces whether the next character satisfies the supplied predicate. This parser uses (monad)
to evaluate the item
parser. The value bound to x
will be just the next character, not the vector. If the predicate returns true, the (monad)
returns a parser created with (result)
; or fail
otherwise. Using (result x)
where x
is a single value or a combination of parsed values is a common idiom.
(defn sat [pred]
(monad [x item]
(if (pred x) (result x) fail)))
Having now a way to check for a type of character we can look for digits:
(def digit (sat #(Character/isDigit %)))
(parse digit "787")
;; [\7 (\8 \7)]
(parse digit "xyz")
;; []
Just the same we can declare parser of other types of characters:
(def letter (sat #(Character/isLetter %)))
(def lower (sat #(Character/isLowerCase %)))
(def upper (sat #(Character/isUpperCase %)))
(def alphanum (sat #(Character/isLetterOrDigit %)))
(def wspace (sat #(Character/isWhitespace %)))
(parse letter "xyz")
;; [\x (\y \z)]
(parse lower "xyz")
;; [\x (\y \z)]
(parse upper "ABC")
;; [\A (\B \C)]
(parse alphanum "007")
;; [\0 (\0 \7)]
(parse alphanum "Bond")
;; [\B (\o \n \d)]
(parse wspace "\t\n")
[\tab (\newline)]
We now have enough to parse simple letter combinations:
(parse (monad [x letter y digit]
(result [x y])) "U2")
;; [[\U \2] nil]
(defn choice [p q] (->Parser (fn [s] (let [[v s'] (parse p s)] (if v [v s'] (parse q s))))))
(defn chr [x] (sat #(= x %)))
(defn string x & y (if x (monad [c (chr x) cs (string y)] (result (str c cs))) (result "")))
(defn skip [p] (monad [_ p] (result "")))