Functional programming in Haskell: Parsing Text
The Parsing Problem
Parsing is the mechanism we use to make sense of structured information, e.g. written or spoken language. In the case of written language, it involves several steps:
- recognizing the characters of the writing system,
- identifying words,
- identifying sentences,
- identifying paragraphs etc.
To be able to do so we need to know the writing system, spelling and grammar of the language in which the document is written.
For parsing structured text such as program source code, HTML or JSON, the problem is similar.
Some handy functional machinery
Returning functions as values
- So far, we have seen functions that take functions as arguments
- Functions can also return functions as values
- For example, partial application of a function:
sum = foldl (+) 0
Here (sum) is the result returned by partial application of (foldl).
-
- More explicitly, we can write this as:
sum = xs -> foldl (+) 0 xs
Here (sum) is a function resulting from partial application of (foldl).
-
- Both are of course the same thing, just different interpretations.
Function generators
-
- We can use this concept to generate parameterised functions
For example, the following function generates functions that add a constant number to their argument:
gen_add_n = n ->
x -> x+n
add_3 = gen_add_n 3
add_7 = gen_add_n 7
add_3 5 --> 8
add_7 4 --> 11
-
- This is of course not limited to numeric constants
For example, the following function generates functions that perform a given arithmetic operation on a constant number and their argument:
gen_op_n = op n ->
x -> x `op` n
add_3 = gen_op_n (+) 3
mult_7 = gen_op_n (*) 7
add_3 5 --> 8
mult_7 4 --> 28
Practical parsing
Cooking soba noodles
To make the parsing problem more concrete, suppose you have to parse the following recipe and identify the different steps required in the preparation.
Bring a large pot of water up to a boil. Unlike Italian pasta, you do not need to salt the water. Once it’s boiling, hold the noodles over the water and sprinkle them in strand by strand. Once all the noodles are in, stir gently so that they are all immersed in the water. Bring the water back up to a gentle boil, then lower the heat so that the water is just simmering. (This differs from the ’rolling boil’ that’s recommended for pasta.) If the water threatens to boil over, add about 1/2 cup of cold water (but if you lower the heat to the gentle simmer, and have a big enough pot, this shouldn’t be necessary). Cook for about 7 to 8 minutes, or following the package directions (for thinner noodles 5 to 6 minutes may be enough. Test by eating a strand – it should be cooked through, not al dente, but not mushy either).
Parsing text
-
- Typically, a functional program is organised around a tree-like data structure with an algebraic data type that represents the core data
-
- A parser reads text input and generates the tree
-
- Functions perform transformations or traversals on the tree
-
- Pretty-printer functions output the tree (original or transformed)
Alternative approaches to parsing
-
- Don’t bother with parsing, just make the user provide input in an awkward form. A common approach, but please don’t do this!
-
- Write the parser by hand, with just ordinary list processing functions. Possible, but hard and not reusable. Don’t do it.
-
- Write the parser using regular expressions. Tempting but limiting and not reusable. Don’t do it, unless your input text format is very simple.
-
- Use parser combinators. For most purposes, this is the recommended approach, for everything from basic input formats to medium size programming languages (e.g. Pascal) or subsets of languages like C and Fortran.
-
- Use a parser generator, e.g. yacc, bison, antlr, happy. The best approach for heavy-weight parsers, for very large programming languages.
Parser combinators
-
- Parser combinators are functions that allow you to combine smaller parsers into bigger ones.
-
- They are higher-order functions that take functions as arguments and return functions
-
- A parser combinator library provides both basic parsers (for words, numbers etc.) and combinators.
Parsec: monadic parsing combinators
-
- There are many parsing libraries for Haskell.
-
- One of the most widely used is Parsec, which is robust, flexible, expressive, and efficient.
-
- Parsec operates in a monad.
A Quick Primer on Monads
You may have heard the term monad before, and we will discuss the concept in detail in a later session. Haskell uses monads to structure computations. You have already encountered the IO monad, which you need to use to perform IO in a Haskell program. A typical example is
hello :: String -> IO String
hello x =
do
putStrLn ("Hello, " ++ x)
putStrLn "What's your name?"
name <- getLine
return name
This illustrates the key syntactic features of a monad: the do
keyword, the sequence of commands, the way to extract information from a monadic computation using the left arrow <-
and the return
keyword. In fact, using the do-notation is quite similar to imperative programming.
Also note the return value of our hello function: not just String but IO String. A computation done in a monad returns a “monadic” type, we say that the string is returned inside the monad.
Form of a parser
-
- For example, suppose we want to parse a string of the form (< tag>), where (tag) must be a word, and return the tag as a type (Tag).
To run the code below you need to import some libraries and add some additional code. All this is explained in detail in the tutorial; the full source code is on GitHub.
data Tag = MkTag String
parseTag :: Parser Tag
parseTag =
do char '<'
x <- identifier
char '>'
return (MkTag x)
As you can see, the parser consists of a number of functions (e.g. char and identifier) that are called sequentially. Also, the return value is of type Parser Tag, not simply Tag.
This is because parseTag
is not returning a value, instead it returns a parser. We can combine this parser with other parsers, and then we can execute the final parser on our data. We will cover this approach in more detail in the tutorial.
Testing a parser
To test your parser, start ghci
:
[wim@fp4 ~]$ ghci
GHCi, version 7.4.1: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Then, import Parsec:
Prelude> import Text.ParserCombinators.Parsec
Parsec provides the handy parseTest
function, which takes a parser and a string and runs it. Let’s try and run the parser char 'b'
on the string "cons"
:
Prelude Text.ParserCombinators.Parsec> parseTest (char 'b') "cons"
Loading package bytestring-0.9.2.1 ... linking ... done.
Loading package transformers-0.2.2.0 ... linking ... done.
Loading package mtl-2.0.1.0 ... linking ... done.
Loading package array-0.4.0.0 ... linking ... done.
Loading package deepseq-1.3.0.0 ... linking ... done.
Loading package text-0.11.2.0 ... linking ... done.
Loading package parsec-3.1.2 ... linking ... done.
Because the string “cons” does not contain the character ‘b’, we get a parse error:
parse error at (line 1, column 1):
unexpected 'c'
expecting 'b'
Let’s try with char 'c'
:
Prelude Text.ParserCombinators.Parsec> parseTest (char 'c') "cons"
'c'
Prelude Text.ParserCombinators.Parsec>
This time the parse succeeded.
Running the parser
The actual code for the example parseTag
requires some extra modules and definitions, this is covered in the tutorial; the full source code is on GitHub.
As a simple example, let’s define parseDiv
as:
-- the "deriving Show" is needed to let `ghci` print the result
data Tag = MkTag String deriving Show
parseDiv = do
string "<div>"
return (MkTag "div")
To define this function in ghci
you can write this on one line as follows:
let parseDiv = do { string "<div>";return $ MkTag "div" }
Now we can run this parser using the parseTest
function:
Prelude Text.ParserCombinators.Parsec> parseTest parseDiv "<div>"
Loading package parsec-2.1.0.1 ... linking ... done.
MkTag "div"
Prelude Text.ParserCombinators.Parsec> parseTest parseDiv "div"
parse error at (line 1, column 1):
unexpected "d"
expecting "< "
Prelude Text.ParserCombinators.Parsec>
Anatomy of a basic parser
-
- All parser combinators are functions that return functions.
-
- It is the returned function that operates on the string, not the parser combinator function.
-
- The basic parsers ((identifier),(natural),(char)) take either no arguments (e.g. (identifier)) or one or more strings for parametrisation (e.g. (char)).
char = ch -> str ->
-- try to match the character ch
-- return the result
If the match succeeds, the matching string is removed from the input string; otherwise, the original string is returned, e.g.
char "c" "cons" -->
"c"
char "b" "cons" -->
parse error at (line 1, column 1):
unexpected "c"
expecting "b"
Anatomy of a parser combinator
-
- Parser combinators such as
<|>
andparens
take other parsers as arguments.
- Parser combinators such as
parens = p ->
str ->
-- first match "("
-- perform the parse of p if "(" was found
-- then match ")"
-- return the result
Parsing alternatives
-
- Often we want to try one parser; if that fails, then try another one instead. The choice combinator
<|>
provides this functionality.
- Often we want to try one parser; if that fails, then try another one instead. The choice combinator
-
- Example: (letter_digit) will match either a letter or a digit.
letter_digit :: Parser Char
letter_digit =
do x <- letter <|> digit
return x
Running alternative parsers
Prelude Text.ParserCombinators.Parsec> parseTest letter_digit "b2"
"b"
Prelude Text.ParserCombinators.Parsec> parseTest letter_digit "2b"
"2"
Prelude Text.ParserCombinators.Parsec> parseTest letter_digit "*2"
parse error at (line 1, column 1):
unexpected "*"
expecting letter or digit
Parsing alternative strings
Suppose we want to match either bag or bog, but nothing else.
bag_bog :: Parser String
bag_bog =
do xs <- string "bag" <|> string "bog"
return xs
Failed alternative consumes input
So far so good:
Prelude Text.ParserCombinators.Parsec> parseTest bag_bog "bag"
"bag"
And a non-matching string fails, as expected.
Prelude Text.ParserCombinators.Parsec> parseTest bag_bog "bug"
parse error at (line 1, column 1):
unexpected "u"
expecting "bag"
But there’s a problem!
Prelude Text.ParserCombinators.Parsec> parseTest bag_bog "bog"
parse error at (line 1, column 1):
unexpected "o"
expecting "bag"
The first parser string “bag” matched the b
but then failed on the a
. It has now consumed the b
. The second parser string “bog” now tries to match b
against o
, which of course fails.
try — don’t consume input on failed parse
To allow you to parse tentatively without consuming any input, Parsec provide the try
function:
bag_bog_try :: Parser String
bag_bog_try =
do xs <- try (string "bag") <|> string "bog"
return xs
Trying a parse without consuming input
Prelude Text.ParserCombinators.Parsec> parseTest bag_bog_try "bag"
"bag"
Prelude Text.ParserCombinators.Parsec> parseTest bag_bog_try "bug"
parse error at (line 1, column 1):
unexpected "u"
expecting "bog"
Prelude Text.ParserCombinators.Parsec> parseTest bag_bog_try "bog"
"bog"
Some parsers from the library
The Parsec library provides some small parsers that are useful for defining bigger ones:
-
- (char; “?”) — (char) is applied to a character, and it gives a parser that matches that character
-
- (letter) — matches any letter
-
- (digit) — matches any digit
-
- (string) — matches a string of characters
-
- (stringLiteral; “xyz*”) — matches the string argument
-
- (many; p) — matches 0 or more occurrences of parser (p)
-
- (many1; p) — matches 1 or more occurrences of parser (p)
Variable names
varname :: Parser String
varname =
do x <- letter
xs <- many (letter <|> digit)
return (x:xs)
Prelude Text.ParserCombinators.Parsec> parseTest varname "a4cc7*5"
"a4cc7"
Prelude Text.ParserCombinators.Parsec> parseTest varname "34a"
parse error at (line 1, column 1):
unexpected "3"
expecting letter
Expression parsers
-
- Arithmetic expressions are complex to parse because of the rules of precedence and the arity of the operators.
-
- Parsec provides support for expression parsing, so you don’t have to write your own expression parser.
expr_parser :: Parser Expr
expr_parser = buildExpressionParser optable term <?> "expression"
optable =
let
op name assoc =
Infix ( do { reservedOp name;
return (x y ->(Op (MkOpExpr name x y))) } ) assoc
prefix name =
Prefix (
reservedOp name >>
return (x->(Pref (MkPrefixOpExpr name x))) )
in
[ [ op "*" AssocLeft, op "/" AssocLeft, op "%" AssocLeft ]
, [ op "+" AssocLeft, op "-" AssocLeft ], [ prefix "-" ] ]
This example uses some additional monad syntax: you can use braces and semicolons instead of indentation; and the >>
operator is also a shorter way of writing the do notation:
do
expr1
expr2
can be written as
expr1 >> expr2
Also note the use of the <?>
operator, this is used to define a custom error message in case a parse fails without consuming any input. This is a very useful debugging feature.
-
- Parsec also has support for programming languages with a mechanism to define the syntax and keywords through makeTokenParser.
-
- For simple cases, you can use emptyDef.
import Text.ParserCombinators.Parsec.Expr
import qualified Text.ParserCombinators.Parsec.Token as P
lexer = P.makeTokenParser emptyDef
parens = P.parens lexer
commaSep = P.commaSep lexer
-- and many more
Functional Programming in Haskell: Supercharge Your Coding
Functional Programming in Haskell: Supercharge Your Coding
Reach your personal and professional goals
Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.
Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.
Register to receive updates
-
Create an account to receive our newsletter, course recommendations and promotions.
Register for free