4.6

## The University of Glasgow

The parsing problem

# Parsing Text Using Higher-Order Functions

## The Parsing Problem

Parsing is the mechanism we use to make sense of structured information, e.g. written or spoken language. In the case of written language, it involves several steps:

• recognizing the characters of the writing system,
• identifying words,
• identifying sentences,
• identifying paragraphs etc.

To be able to do so we need to know the writing system, spelling and grammar of the language in which the document is written.

For parsing structured text such as program source code, HTML or JSON, the problem is similar.

## Some handy functional machinery

### Returning functions as values

• So far, we have seen functions that take functions as arguments
• Functions can also return functions as values
• For example, partial application of a function:
sum = foldl (+) 0


Here (sum) is the result returned by partial application of (foldl).

• More explicitly, we can write this as:
sum = \xs -> foldl (+) 0 xs


Here (sum) is a function resulting from partial application of (foldl).

• Both are of course the same thing, just different interpretations.

### Function generators

• We can use this concept to generate parameterised functions

For example, the following function generates functions that add a constant number to their argument:

gen_add_n = \n ->
\x -> x+n


• This is of course not limited to numeric constants

For example, the following function generates functions that perform a given arithmetic operation on a constant number and their argument:

gen_op_n = \op n ->
\x -> x op n

mult_7 = gen_op_n (*) 7

mult_7 4 --> 28


## Practical parsing

### Cooking soba noodles

To make the parsing problem more concrete, suppose you have to parse the following recipe and identify the different steps required in the preparation.

Bring a large pot of water up to a boil. Unlike Italian pasta, you do not need to salt the water. Once it’s boiling, hold the noodles over the water and sprinkle them in strand by strand. Once all the noodles are in, stir gently so that they are all immersed in the water. Bring the water back up to a gentle boil, then lower the heat so that the water is just simmering. (This differs from the ’rolling boil’ that’s recommended for pasta.) If the water threatens to boil over, add about 1/2 cup of cold water (but if you lower the heat to the gentle simmer, and have a big enough pot, this shouldn’t be necessary). Cook for about 7 to 8 minutes, or following the package directions (for thinner noodles 5 to 6 minutes may be enough. Test by eating a strand - it should be cooked through, not al dente, but not mushy either).

### Parsing text

1. Typically, a functional program is organised around a tree-like data structure with an algebraic data type that represents the core data
2. A parser reads text input and generates the tree
3. Functions perform transformations or traversals on the tree
4. Pretty-printer functions output the tree (original or transformed)

### Alternative approaches to parsing

• Don’t bother with parsing, just make the user provide input in an awkward form. A common approach, but please don’t do this!
• Write the parser by hand, with just ordinary list processing functions. Possible, but hard and not reusable. Don’t do it.
• Write the parser using regular expressions. Tempting but limiting and not reusable. Don’t do it, unless your input text format is very simple.
• Use parser combinators. For most purposes, this is the recommended approach, for everything from basic input formats to medium size programming languages (e.g. Pascal) or subsets of languages like C and Fortran.
• Use a parser generator, e.g. yacc, bison, antlr, happy. The best approach for heavy-weight parsers, for very large programming languages.

### Parser combinators

• Parser combinators are functions that allow you to combine smaller parsers into bigger ones.
• They are higher-order functions that take functions as arguments and return functions
• A parser combinator library provides both basic parsers (for words, numbers etc.) and combinators.

• There are many parsing libraries for Haskell.
• One of the most widely used is Parsec, which is robust, flexible, expressive, and efficient.
• Parsec operates in a monad.

### A Quick Primer on Monads

You may have heard the term monad before, and we will discuss the concept in detail in a later session. Haskell uses monads to structure computations. You have already encountered the IO monad, which you need to use to perform IO in a Haskell program. A typical example is


hello :: String -> IO String
hello x =
do
putStrLn ("Hello, " ++ x)
name <- getLine
return name


This illustrates the key syntactic features of a monad: the do keyword, the sequence of commands, the way to extract information from a monadic computation using the left arrow <- and the return keyword. In fact, using the do-notation is quite similar to imperative programming.

Also note the return value of our hello function: not just String but IO String. A computation done in a monad returns a “monadic” type, we say that the string is returned inside the monad.

### Form of a parser

• For example, suppose we want to parse a string of the form (< tag>), where (tag) must be a word, and return the tag as a type (Tag).

To run the code below you need to import some libraries and add some additional code. All this is explained in detail in the tutorial; the full source code is on GitHub.

data Tag = MkTag String

parseTag :: Parser Tag
parseTag =
do  char '<'
x <- identifier
char '>'
return (MkTag x)


As you can see, the parser consists of a number of functions (e.g. char and identifier) that are called sequentially. Also, the return value is of type Parser Tag, not simply Tag. This is because parseTag is not returning a value, instead it returns a parser. We can combine this parser with other parsers, and then we can execute the final parser on our data. We will cover this approach in more detail in the tutorial.

### Testing a parser

To test your parser, start ghci:

[wim@fp4 ~]$ghci GHCi, version 7.4.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done.  Then, import Parsec: Prelude> import Text.ParserCombinators.Parsec  Parsec provides the handy parseTest function, which takes a parser and a string and runs it. Let’s try and run the parser char 'b' on the string "cons": Prelude Text.ParserCombinators.Parsec> parseTest (char 'b') "cons" Loading package bytestring-0.9.2.1 ... linking ... done. Loading package transformers-0.2.2.0 ... linking ... done. Loading package mtl-2.0.1.0 ... linking ... done. Loading package array-0.4.0.0 ... linking ... done. Loading package deepseq-1.3.0.0 ... linking ... done. Loading package text-0.11.2.0 ... linking ... done. Loading package parsec-3.1.2 ... linking ... done.  Because the string “cons” does not contain the character ‘b’, we get a parse error: parse error at (line 1, column 1): unexpected 'c' expecting 'b'  Let’s try with char 'c': Prelude Text.ParserCombinators.Parsec> parseTest (char 'c') "cons" 'c' Prelude Text.ParserCombinators.Parsec>  This time the parse succeeded. ### Running the parser The actual code for the example parseTag requires some extra modules and definitions, this is covered in the tutorial; the full source code is on GitHub. As a simple example, let’s define parseDiv as: -- the "deriving Show" is needed to let ghci print the result data Tag = MkTag String deriving Show parseDiv = do string "<div>" return (MkTag "div")  To define this function in ghci you can write this on one line as follows: let parseDiv = do { string "<div>";return$ MkTag "div" }


Now we can run this parser using the parseTest function:

Prelude Text.ParserCombinators.Parsec> parseTest parseDiv "<div>"
MkTag "div"

Prelude Text.ParserCombinators.Parsec> parseTest parseDiv "div"
parse error at (line 1, column 1):
unexpected "d"
expecting "< "
Prelude Text.ParserCombinators.Parsec>


### Anatomy of a basic parser

• All parser combinators are functions that return functions.
• It is the returned function that operates on the string, not the parser combinator function.
• The basic parsers ((identifier),(natural),(char)) take either no arguments (e.g. (identifier)) or one or more strings for parametrisation (e.g. (char)).
char = \ch -> \str ->
-- try to match the character ch
-- return the result


If the match succeeds, the matching string is removed from the input string; otherwise, the original string is returned, e.g.

char "c" "cons" -->
"c"
char "b" "cons" -->
parse error at (line 1, column 1):
unexpected "c"
expecting "b"


### Anatomy of a parser combinator

• Parser combinators such as <|> and parens take other parsers as arguments.
parens = \p ->
\str ->
-- first match "("
-- perform the parse of p if "(" was found
-- then match ")"
-- return the result


### Parsing alternatives

• Often we want to try one parser; if that fails, then try another one instead. The choice combinator <|> provides this functionality.
• Example: (letter_digit) will match either a letter or a digit.
letter_digit :: Parser Char
letter_digit =
do  x <- letter <|> digit
return x


#### Running alternative parsers

Prelude Text.ParserCombinators.Parsec> parseTest letter_digit "b2"
"b"

Prelude Text.ParserCombinators.Parsec> parseTest letter_digit "2b"
"2"

Prelude Text.ParserCombinators.Parsec> parseTest letter_digit "*2"
parse error at (line 1, column 1):
unexpected "*"
expecting letter or digit


### Parsing alternative strings

Suppose we want to match either bag or bog, but nothing else.

bag_bog :: Parser String
bag_bog =
do  xs <- string "bag" <|> string "bog"
return xs


#### Failed alternative consumes input

So far so good:

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog "bag"
"bag"


And a non-matching string fails, as expected.

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog "bug"
parse error at (line 1, column 1):
unexpected "u"
expecting "bag"


But there’s a problem!

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog "bog"
parse error at (line 1, column 1):
unexpected "o"
expecting "bag"


The first parser string “bag” matched the b but then failed on the a. It has now consumed the b. The second parser string “bog” now tries to match b against o, which of course fails.

### try — don’t consume input on failed parse

To allow you to parse tentatively without consuming any input, Parsec provide the try function:

bag_bog_try :: Parser String
bag_bog_try =
do  xs <- try (string "bag") <|> string "bog"
return xs


#### Trying a parse without consuming input

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog_try "bag"
"bag"

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog_try "bug"
parse error at (line 1, column 1):
unexpected "u"
expecting "bog"

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog_try "bog"
"bog"


### Some parsers from the library

The Parsec library provides some small parsers that are useful for defining bigger ones:

• (char\; “?”) — (char) is applied to a character, and it gives a parser that matches that character
• (letter) — matches any letter
• (digit) — matches any digit
• (string) — matches a string of characters
• (stringLiteral\; “xyz*”) — matches the string argument
• (many\; p) — matches 0 or more occurrences of parser (p)
• (many1\; p) — matches 1 or more occurrences of parser (p)

### Variable names

varname :: Parser String
varname =
do  x <- letter
xs <- many (letter <|> digit)
return (x:xs)

Prelude Text.ParserCombinators.Parsec> parseTest varname "a4cc7*5"
"a4cc7"
Prelude Text.ParserCombinators.Parsec> parseTest varname "34a"
parse error at (line 1, column 1):
unexpected "3"
expecting letter


### Expression parsers

• Arithmetic expressions are complex to parse because of the rules of precedence and the arity of the operators.
• Parsec provides support for expression parsing, so you don’t have to write your own expression parser.
expr_parser :: Parser Expr
expr_parser = buildExpressionParser optable term <?> "expression"

optable =
let
op name assoc   =
Infix ( do {  reservedOp name;
return (\x y ->(Op (MkOpExpr name x y))) } ) assoc
prefix name =
Prefix  (
reservedOp name >>
return (\x->(Pref (MkPrefixOpExpr name x))) )
in
[ [ op "*"  AssocLeft, op "/"  AssocLeft, op "%" AssocLeft ]
, [ op "+"  AssocLeft, op "-"  AssocLeft ], [ prefix "-" ] ]


This example uses some additional monad syntax: you can use braces and semicolons instead of indentation; and the >> operator is also a shorter way of writing the do notation:

do
expr1
expr2


can be written as

expr1 >> expr2


Also note the use of the <?> operator, this is used to define a custom error message in case a parse fails without consuming any input. This is a very useful debugging feature.

• Parsec also has support for programming languages with a mechanism to define the syntax and keywords through makeTokenParser.
• For simple cases, you can use emptyDef.
import Text.ParserCombinators.Parsec.Expr
import qualified Text.ParserCombinators.Parsec.Token as P
lexer       = P.makeTokenParser emptyDef

parens          = P.parens lexer
commaSep        = P.commaSep lexer
-- and many more