-
Notifications
You must be signed in to change notification settings - Fork 4
Writing a Simple Parser
This tutorial is based on chapter 8, Functional Parsers, from Programming in Haskell, by Graham Hutton. The idea is to parse text by taking one character at a time from the front of an input string and returning a value and the rest of the string. In order to parse anything more than a few characters is necessary to repeat this process on the resulting value:
[input] --> [parser] --> [value rest]
--> [parser] --> [new-value rest]
--> ...
The second parser will usually combine the result of the first in order to create the desired parsed value while advancing on the input. Thus we have two changing computations: the parsed value and the rest of the input string. We want to set this process in motion while keeping our functions pure of side effects. For that we'll use a custom monad.
For our new monad we need a data type, which we'll define after loading the Morph core
namespace.
(use 'blancas.morph.core)
(deftype Parser [f])
(defn parse [p s]
((.f p) s))
This is a simple parser and we won't need more than a field for the boxed function. A parser function takes a string and returns a vector of two elements: the value and the rest of the input. The parsed value may have come directly from the immediate input or may be a combination of previous values. The function parse
applies a parser by extracting the boxed function.
result
is so simple that it may not seem like a real parser. It takes a value and returns a parser that, when run by parse
, returns the given value and the input unchanged. The usefulness of this parser will become apparent shortly.
(defn result [v]
(->Parser (fn [s] [v s])))
(parse (result "DC-10") "foobar")
;; ["DC-10" "foobar"]
(parse (result 747) "a plane")
;; [747 "a plane"]
The function parse
extracts the function from the Parser instance and calls it with its second (string) argument.
fail
is another primitive parser. It creates a Parser instance that, when called by parse
returns an empty vector, regardless of the input. This indicates that the parser failed.
(def fail (->Parser (fn [s] [])))
(parse fail "foobar")
;; []
item
creates a Parser instance that, when called by parse
returns a vector with two elements. The first element is the first character in the input; the second element is the rest of the input. If the input is empty, the parser's result is the empty vector.
(def item (->Parser (fn [[x & y]] (if x [x y] []))))
(parse item "xyz")
;; [\\x (\\y \\z)]
(parse item "yz")
;; [\\y (\\z)]
(parse item "z")
;; [\\z nil]
(parse item "")
;; []
We now have three basic parsing functions. One creates a new Parse instance; a second fails the parser; and a third extracts the first character from the input. The next thing to do is a way to chain the calls to parsers so that we can process and analyze the input. For example, if we wanted to get the second character, we could do something like this:
(let [[v s](parse item "xyz")]
(parse item s))
;; [\\y (\\z)]
(parse item ((parse item "xyz") 1))
;; [\\y (\\z)]
But this kind of code would be very limited and not able to scale for repetition and more complex logic. What we'll do instead is take advantage of the plumbing in Morph to do the above sequentially, as in imperative programming. The code below implements the Monad
protocol. The function >>=
(pronounced "then") takes a Parser instance and a function that takes one value and returns a new Parser instance. Thus with multiple calls to >>=
we'll be able to chain parser calls: the fist parser gets chained with the one returned by the function. Let's take a look at our implementation of >>=
:
(extend-type Parser
Monad
(>>= [m k]
(->Parser (fn [s]
(let [[v s'] (parse m s)]
(if v (parse (k v) s') []))))))
We can see that >>=
is a high-order function, though the returned function is wrapped in a Parser instance. When run by parse
this function in turn calls parse
on its first argument; then calls parse
on the parser returned by the supplied k
function, which takes the first parser's value as its argument. Note that the second call to parse
uses the rest of the input.
Once this is implemented, we can use the >>=
function to get the second character this way:
(parse (>>= item (fn [x] item)) "xyz")
;; [\\y (\\z)]
The above example gets simpler with the monad
macro, which is used for sequential calls to parsers. This macro converts the code to calls to >>=
chaining functionality. We can now get the second character much easier:
(parse (monad [_ item] item) "xyz")
;; [\\y (\\z)]
Note that we don't have to write the function k
anymore. The monad
macro does so automatically as we just have to return a Parser instance as the last value. Using this macro we can now define more useful parsers.
sat
enforces whether the next character satisfies the supplied predicate. This parser uses (monad)
to evaluate the item
parser. The value bound to x
will be just the next character, not the vector. If the predicate returns true, the (monad)
returns a parser created with (result)
; or fail
otherwise. Using (result x)
where x
is a single value or a combination of parsed values is a common idiom.
(defn sat [pred]
(monad [x item]
(if (pred x) (result x) fail)))
Having now a way to check for a type of character we can look for digits:
(def digit (sat #(Character/isDigit %)))
(parse digit "787")
;; [\\7 (\\8 \\7)]
(parse digit "xyz")
;; []
Just the same we can declare parser of other types of characters:
(def letter (sat #(Character/isLetter %)))
(def lower (sat #(Character/isLowerCase %)))
(def upper (sat #(Character/isUpperCase %)))
(def alphanum (sat #(Character/isLetterOrDigit %)))
(def wspace (sat #(Character/isWhitespace %)))
(parse letter "xyz")
;; [\\x (\\y \\z)]
(parse lower "xyz")
;; [\\x (\\y \\z)]
(parse upper "ABC")
;; [\\A (\\B \\C)]
(parse alphanum "007")
;; [\\0 (\\0 \\7)]
(parse alphanum "Bond")
;; [\\B (\\o \\n \\d)]
(parse wspace "\\t\\n")
;; [\\tab (\\newline)]
We now have enough to parse simple letter combinations:
(parse (monad [x letter y digit]
(result [x y])) "U2")
;; [[\\U \\2] nil]
chr
expects to parse the character supplied as the argument.
(defn chr [x]
(sat #(= x %)))
(parse (chr \\z) "zip")
;; [[\\z (\\i \\p)]]
string
expects to parse the string supplied as the argument. Note that this is not a double-quoted string literal. This parser uses chr
for the first letter and then calls itself recursively for the rest, then concatenates the result.
(defn string [[x & y]]
(if x
(monad [c (chr x) cs (string y)]
(result (str c cs)))
(result "")))
(parse (string "Airbus") "Airbus A380")
;; ["Airbus" (\\space \\A \\3 \\8 \\0)]
choice
tries parser p
; if it succeeds, it returns the results; if it fails tries q
with the same input.
(defn choice [p q]
(->Parser (fn [s]
(let [[v s'] (parse p s)]
(if v [v s'] (parse q s))))))
skip
parser the supplied parser and ignores the result.
(defn skip [p]
(monad [_ p] (result "")))
(parse (skip (many letter)) "bond 007")
;; ["" (\\space \\0 \\0 \\7)]
We'll now define combinators for parsing repeatedly, which is the parser's version of loops. We'll use a declare
first since many
and many1
are mutually recursive. many
parses p
zero or more times. many1
parses p
one or more times.
(declare many1)
(defn many [p]
(choice (many1 p) (result [])))
(defn many1 [p]
(monad [x p xs (many p)]
(result (concat [x] xs))))
Note that many
always succeeds, even if p
fails the first time. It's OK for p
to fail at some point because that's what stops the loop. Using these looping parsers we can now define more complex ones.
ident
parses an identifier, starting with a letter and then any letter or number.
(def ident (monad [c lower cs (many alphanum)]
(result (apply str (concat [c] cs)))))
(parse ident "sales = 100")
;; ["sales" (\\space \\= \\space \\1 \\0 \\0)]
nat
parses an integer (natural) number.
(def nat (monad [xs (many1 digit)]
(result (read-string (apply str xs)))))
(parse nat "68000")
;; [68000 nil]
whitespace
skips zero or more whitespace characters.
(def whitespace (skip (many wspace)))
(parse whitespace " \t\t\nfoobar")
;; ["" (\f \o \o \b \a \r)]
The following parser can parse previous ones surrounded by whitespace.
(defn token [p] (monad [_ whitespace x p _ whitespace] (result x)))
(def identifier (token ident))
(def natural (token nat))
(defn sym [xs] (token (string xs)))
This parser will read a list of numbers:
(def lst (monad [_ (sym "[")
n natural
ns (many (monad [_ (sym ",") x natural] (result x)))
_ (sym "]")]
(result (concat [n] ns))))
(parse lst "[1,2,3,4,5,6,7,8,9,10")
;; (parse lst "[1,2,3,4,5,6,7,8,9,10]")
[(1 2 3 4 5 6 7 8 9 10) nil]