Skip main navigation

New offer! Get 30% off one whole year of Unlimited learning. Subscribe for just £249.99 £174.99. New subscribers only. T&Cs apply

Find out more

Functional programming in Haskell: Parsing Text

Parsing is the mechanism we use to make sense of structured information. You will learn how Haskell makes it easy to parse text.
A sign with Japanese text
© Wim Vanderbauwhede

The Parsing Problem

Parsing is the mechanism we use to make sense of structured information, e.g. written or spoken language. In the case of written language, it involves several steps:

  • recognizing the characters of the writing system,
  • identifying words,
  • identifying sentences,
  • identifying paragraphs etc.

To be able to do so we need to know the writing system, spelling and grammar of the language in which the document is written.

For parsing structured text such as program source code, HTML or JSON, the problem is similar.

Some handy functional machinery

Returning functions as values

  • So far, we have seen functions that take functions as arguments
  • Functions can also return functions as values
  • For example, partial application of a function:
sum = foldl (+) 0

 

Here (sum) is the result returned by partial application of (foldl).

 

 

    • More explicitly, we can write this as:

 

 

sum = xs -> foldl (+) 0 xs

 

Here (sum) is a function resulting from partial application of (foldl).

 

 

    • Both are of course the same thing, just different interpretations.

 

 

Function generators

 

 

    • We can use this concept to generate parameterised functions

 

 

For example, the following function generates functions that add a constant number to their argument:

 

gen_add_n = n ->
 x -> x+n

add_3 = gen_add_n 3
add_7 = gen_add_n 7

add_3 5 --> 8
add_7 4 --> 11

 

 

    • This is of course not limited to numeric constants

 

 

For example, the following function generates functions that perform a given arithmetic operation on a constant number and their argument:

 

gen_op_n = op n ->
 x -> x `op` n

add_3 = gen_op_n (+) 3
mult_7 = gen_op_n (*) 7

add_3 5 --> 8
mult_7 4 --> 28

 

Practical parsing

 

Cooking soba noodles

 

To make the parsing problem more concrete, suppose you have to parse the following recipe and identify the different steps required in the preparation.

 

Bring a large pot of water up to a boil. Unlike Italian pasta, you do not need to salt the water. Once it’s boiling, hold the noodles over the water and sprinkle them in strand by strand. Once all the noodles are in, stir gently so that they are all immersed in the water. Bring the water back up to a gentle boil, then lower the heat so that the water is just simmering. (This differs from the ’rolling boil’ that’s recommended for pasta.) If the water threatens to boil over, add about 1/2 cup of cold water (but if you lower the heat to the gentle simmer, and have a big enough pot, this shouldn’t be necessary). Cook for about 7 to 8 minutes, or following the package directions (for thinner noodles 5 to 6 minutes may be enough. Test by eating a strand – it should be cooked through, not al dente, but not mushy either).

 

Parsing text

 

 

    1. Typically, a functional program is organised around a tree-like data structure with an algebraic data type that represents the core data

 

    1. A parser reads text input and generates the tree

 

    1. Functions perform transformations or traversals on the tree

 

    1. Pretty-printer functions output the tree (original or transformed)

 

 

Alternative approaches to parsing

 

 

    • Don’t bother with parsing, just make the user provide input in an awkward form. A common approach, but please don’t do this!

 

    • Write the parser by hand, with just ordinary list processing functions. Possible, but hard and not reusable. Don’t do it.

 

    • Write the parser using regular expressions. Tempting but limiting and not reusable. Don’t do it, unless your input text format is very simple.

 

    • Use parser combinators. For most purposes, this is the recommended approach, for everything from basic input formats to medium size programming languages (e.g. Pascal) or subsets of languages like C and Fortran.

 

    • Use a parser generator, e.g. yacc, bison, antlr, happy. The best approach for heavy-weight parsers, for very large programming languages.

 

 

Parser combinators

 

 

    • Parser combinators are functions that allow you to combine smaller parsers into bigger ones.

 

    • They are higher-order functions that take functions as arguments and return functions

 

    • A parser combinator library provides both basic parsers (for words, numbers etc.) and combinators.

 

 

Parsec: monadic parsing combinators

 

 

    • There are many parsing libraries for Haskell.

 

    • One of the most widely used is Parsec, which is robust, flexible, expressive, and efficient.

 

    • Parsec operates in a monad.

 

 

A Quick Primer on Monads

 

You may have heard the term monad before, and we will discuss the concept in detail in a later session. Haskell uses monads to structure computations. You have already encountered the IO monad, which you need to use to perform IO in a Haskell program. A typical example is

 


hello :: String -> IO String
hello x =
 do
 putStrLn ("Hello, " ++ x)
 putStrLn "What's your name?"
 name <- getLine
 return name

 

This illustrates the key syntactic features of a monad: the do keyword, the sequence of commands, the way to extract information from a monadic computation using the left arrow <- and the return keyword. In fact, using the do-notation is quite similar to imperative programming.

 

Also note the return value of our hello function: not just String but IO String. A computation done in a monad returns a “monadic” type, we say that the string is returned inside the monad.

 

Form of a parser

 

 

    • For example, suppose we want to parse a string of the form (< tag>), where (tag) must be a word, and return the tag as a type (Tag).

 

 

To run the code below you need to import some libraries and add some additional code. All this is explained in detail in the tutorial; the full source code is on GitHub.

 

data Tag = MkTag String

parseTag :: Parser Tag
parseTag =
 do char '<'
 x <- identifier
 char '>'
 return (MkTag x)

 

As you can see, the parser consists of a number of functions (e.g. char and identifier) that are called sequentially. Also, the return value is of type Parser Tag, not simply Tag.
This is because parseTag is not returning a value, instead it returns a parser. We can combine this parser with other parsers, and then we can execute the final parser on our data. We will cover this approach in more detail in the tutorial.

 

Testing a parser

 

To test your parser, start ghci:

 

[wim@fp4 ~]$ ghci
GHCi, version 7.4.1: http://www.haskell.org/ghc/ :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.

 

Then, import Parsec:

 

Prelude> import Text.ParserCombinators.Parsec

 

Parsec provides the handy parseTest function, which takes a parser and a string and runs it. Let’s try and run the parser char 'b' on the string "cons":

 

Prelude Text.ParserCombinators.Parsec> parseTest (char 'b') "cons"
Loading package bytestring-0.9.2.1 ... linking ... done.
Loading package transformers-0.2.2.0 ... linking ... done.
Loading package mtl-2.0.1.0 ... linking ... done.
Loading package array-0.4.0.0 ... linking ... done.
Loading package deepseq-1.3.0.0 ... linking ... done.
Loading package text-0.11.2.0 ... linking ... done.
Loading package parsec-3.1.2 ... linking ... done.

 

Because the string “cons” does not contain the character ‘b’, we get a parse error:

 

parse error at (line 1, column 1):
unexpected 'c'
expecting 'b'

 

Let’s try with char 'c':

 

Prelude Text.ParserCombinators.Parsec> parseTest (char 'c') "cons"
'c'
Prelude Text.ParserCombinators.Parsec>

 

This time the parse succeeded.

 

Running the parser

 

The actual code for the example parseTag requires some extra modules and definitions, this is covered in the tutorial; the full source code is on GitHub.

 

As a simple example, let’s define parseDiv as:

 

-- the "deriving Show" is needed to let `ghci` print the result
data Tag = MkTag String deriving Show 

parseDiv = do 
 string "<div>" 
 return (MkTag "div")

 

To define this function in ghci you can write this on one line as follows:

 

let parseDiv = do { string "<div>";return $ MkTag "div" }

 

Now we can run this parser using the parseTest function:

 

Prelude Text.ParserCombinators.Parsec> parseTest parseDiv "<div>"
Loading package parsec-2.1.0.1 ... linking ... done.
MkTag "div"

Prelude Text.ParserCombinators.Parsec> parseTest parseDiv "div"
parse error at (line 1, column 1):
unexpected "d"
expecting "< "
Prelude Text.ParserCombinators.Parsec>

 

Anatomy of a basic parser

 

 

    • All parser combinators are functions that return functions.

 

    • It is the returned function that operates on the string, not the parser combinator function.

 

    • The basic parsers ((identifier),(natural),(char)) take either no arguments (e.g. (identifier)) or one or more strings for parametrisation (e.g. (char)).

 

 

char = ch -> str ->
 -- try to match the character ch
 -- return the result

 

If the match succeeds, the matching string is removed from the input string; otherwise, the original string is returned, e.g.

 

char "c" "cons" -->
"c"
char "b" "cons" -->
parse error at (line 1, column 1):
unexpected "c"
expecting "b"

 

Anatomy of a parser combinator

 

 

    • Parser combinators such as <|> and parens take other parsers as arguments.

 

 

parens = p ->
 str ->
 -- first match "("
 -- perform the parse of p if "(" was found
 -- then match ")"
 -- return the result

 

Parsing alternatives

 

 

    • Often we want to try one parser; if that fails, then try another one instead. The choice combinator <|> provides this functionality.

 

    • Example: (letter_digit) will match either a letter or a digit.

 

 

letter_digit :: Parser Char
letter_digit =
 do x <- letter <|> digit
 return x

 

Running alternative parsers

 

Prelude Text.ParserCombinators.Parsec> parseTest letter_digit "b2"
"b"

Prelude Text.ParserCombinators.Parsec> parseTest letter_digit "2b"
"2"

Prelude Text.ParserCombinators.Parsec> parseTest letter_digit "*2"
parse error at (line 1, column 1):
unexpected "*"
expecting letter or digit

 

Parsing alternative strings

 

Suppose we want to match either bag or bog, but nothing else.

 

bag_bog :: Parser String
bag_bog =
 do xs <- string "bag" <|> string "bog"
 return xs

 

Failed alternative consumes input

 

So far so good:

 

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog "bag"
"bag"

 

And a non-matching string fails, as expected.

 

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog "bug"
parse error at (line 1, column 1):
unexpected "u"
expecting "bag"

 

But there’s a problem!

 

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog "bog"
parse error at (line 1, column 1):
unexpected "o"
expecting "bag"

 

The first parser string “bag” matched the b but then failed on the a. It has now consumed the b. The second parser string “bog” now tries to match b against o, which of course fails.

 

try — don’t consume input on failed parse

 

To allow you to parse tentatively without consuming any input, Parsec provide the try function:

 

bag_bog_try :: Parser String
bag_bog_try =
 do xs <- try (string "bag") <|> string "bog"
 return xs

 

Trying a parse without consuming input

 

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog_try "bag"
"bag"

 

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog_try "bug"
parse error at (line 1, column 1):
unexpected "u"
expecting "bog"

 

Prelude Text.ParserCombinators.Parsec> parseTest bag_bog_try "bog"
"bog"

 

Some parsers from the library

 

The Parsec library provides some small parsers that are useful for defining bigger ones:

 

 

    • (char; “?”) — (char) is applied to a character, and it gives a parser that matches that character

 

    • (letter) — matches any letter

 

    • (digit) — matches any digit

 

    • (string) — matches a string of characters

 

    • (stringLiteral; “xyz*”) — matches the string argument

 

    • (many; p) — matches 0 or more occurrences of parser (p)

 

    • (many1; p) — matches 1 or more occurrences of parser (p)

 

 

Variable names

 

varname :: Parser String
varname =
 do x <- letter
 xs <- many (letter <|> digit)
 return (x:xs)

 

Prelude Text.ParserCombinators.Parsec> parseTest varname "a4cc7*5"
"a4cc7"
Prelude Text.ParserCombinators.Parsec> parseTest varname "34a"
parse error at (line 1, column 1):
unexpected "3"
expecting letter

 

Expression parsers

 

 

    • Arithmetic expressions are complex to parse because of the rules of precedence and the arity of the operators.

 

    • Parsec provides support for expression parsing, so you don’t have to write your own expression parser.

 

 

expr_parser :: Parser Expr
expr_parser = buildExpressionParser optable term <?> "expression"

optable =
 let
 op name assoc =
 Infix ( do { reservedOp name;
 return (x y ->(Op (MkOpExpr name x y))) } ) assoc
 prefix name =
 Prefix (
 reservedOp name >>
 return (x->(Pref (MkPrefixOpExpr name x))) )
 in
 [ [ op "*" AssocLeft, op "/" AssocLeft, op "%" AssocLeft ]
 , [ op "+" AssocLeft, op "-" AssocLeft ], [ prefix "-" ] ]

 

This example uses some additional monad syntax: you can use braces and semicolons instead of indentation; and the >> operator is also a shorter way of writing the do notation:

 

do
 expr1
 expr2

 

can be written as

 

expr1 >> expr2

 

Also note the use of the <?> operator, this is used to define a custom error message in case a parse fails without consuming any input. This is a very useful debugging feature.

 

 

    • Parsec also has support for programming languages with a mechanism to define the syntax and keywords through makeTokenParser.

 

    • For simple cases, you can use emptyDef.

 

 

import Text.ParserCombinators.Parsec.Expr
import qualified Text.ParserCombinators.Parsec.Token as P
lexer = P.makeTokenParser emptyDef

parens = P.parens lexer
commaSep = P.commaSep lexer
-- and many more
© University of Glasgow
This article is from the free online

Functional Programming in Haskell: Supercharge Your Coding

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now