Contact FutureLearn for Support
Skip main navigation
We use cookies to give you a better experience, if that’s ok you can close this message and carry on browsing. For more info read our cookies policy.
We use cookies to give you a better experience. Carry on browsing if you're happy with this, or read our cookies policy for more information.

Skip to 0 minutes and 5 secondsWIM: In this tutorial, I want to explain how you can use the Parsec parser combinator library for practical parsing tasks.

Skip to 0 minutes and 16 secondsYou have learned before how to use the Show typeclass and the show function to print out Haskell data structures. So in this tutorial, I want to show you how you can create a parser for the output from the Show function and then turn it into XML, for example.

Skip to 0 minutes and 36 secondsSo let's start by creating a few data structures.

Skip to 0 minutes and 43 secondsSo we create a data structure PersonRecord just as an example, where we have the name of the person, the address, some integer identifier, and a few labels. So the Address is a data structure in its own right and so are the labels. They are an algebraic data structure like this.

Skip to 1 minute and 8 secondsNote that we have used deriving Show so that you don't have to write your own Show function. By using this deriving clause, GHC will automatically create the show function for these data structures. And it is the output of this automatically created show function that we are going to parse. So let's create a few instances of this data type that we have here. So let's say record 1 is, and record 2. And now we can use show on these two records, and this is the output that show gives us. So it's this output that we now want to transform into XML using the Parsec library. So let's go and build a parser using the Parsec library.

Skip to 2 minutes and 17 secondsI'm creating a module showParser, which exports a single function ParseShow, which is going to parse the output from Show and turn it into XML. To use Parsec in practice, there are a number of modules that you usually need. First, we have the actual parser combinators, then we have a library of token parsers, and then we have a handy support library called Parsec.Language, which defines the basics of the programming language by defining what is an identifier, what is a comment, and things like that. But the Parsec.Token library exports a function makeTokenParser and this function takes a language definition.

Skip to 3 minutes and 1 secondNow, because we're not really having a programming language where we use emptyDef, if you want to know more about the language definitions, I suggest you read the Parsec manual. So then, given this lexer, so basically, the definition of what are identifiers and so on in our system, we can now build a number of parsers very conveniently like this. So the library finds a number of parsers for a given lexer, and now you can actually parse parentheses, brackets, braces, comma-separated lists, white spaces, symbols being predefined strings, identifiers being whatever is an identifier in your language, integers, and so on. There are a lot more of them, and again, you can look at the Parsec library.

Skip to 3 minutes and 48 secondsSo with this boilerplate, we can start making the actual parser, and this will be a parser that takes a string, being the output from the Show function, and it produces a new string, which is the XML representation of this input string. So we have a function with signature, ParseShow, string to string. And the way this works is that we have a function that creates-- that is our actual parser, and this parser is a monadic parser, which means, basically, that it consists of functions that you put together. So you actually produce one big combined function, and then you apply that function to the string. And to apply this function to the string, we use another function, run_parser, that looks as follows.

Skip to 4 minutes and 43 secondsSo run_parser is independent of the actual input of the parser. So you have a parser of a certain data type and a string, and it produces this output data type that you desire. So in our case, the data type that we desire is a string, so see here is a string. And what it does is it applies our parser to an input string, and then if there's an error, it reports an error. Or if there's no error, it returns the result. So this function, run_parser, is kind of generic. So you can always reuse it. In our case, the parser that we want to build, we will call showParser, and that's a parser of string. So it returns a string.

Skip to 5 minutes and 28 secondsSo with this showParser and run_parser, we can now build our actual ParseShow function. So we say, ParseShow equals simply like this. So maybe we want to make this a little bit more complete. To emit valid XML, we can tag this with an actual XML header string. So let's say, we define the XML header string as, and then we can actually change this function a little bit.

Skip to 6 minutes and 7 secondsIn the opening tag, we must actually create, not just the name of the tag, but also the list of attributes. So the list of attributes is a list of tuples for the name of the attribute and the value of the attribute. And so what we do is we apply a map of a function that turns this tuple into a string, where we have the key, and then equals quote, , value, quote. Now, so this approach here for string concatenation is very convenient, rather than using the ++ in terms of readability. So concat will simply concatenate all of the elements of a list. With these tag manipulation functions, we are almost ready to create the actual parses.

Skip to 6 minutes and 52 secondsThere's another function that is very handy to use. It's a function that will join elements of a list using newline. So let's build that function. We have join. And I use the function intercalate. So intercalate is a function that takes all the elements of a list and joins them together, interspersed with the particular character that is the first argument of this function. But this function intercalate is not part of the standard prelude, so we have to add it to our import list here. With all this machinery ready, we can now create the rest of our parsers.

Skip to 7 minutes and 32 secondsFor instance, if we want to parse a list, so we parse brackets, and then a comma-separated value of whatever can be inside the list till we call showParser again. And the result is the list of the elements that we parsed. What we do then is we tag this list with the type list, and then we join them with newlines, and each element is surrounded by a tag list element. The other parsers are quite similar. So I'll just put them all here. We have a tuple_parser, which parses parentheses and then comma-separated. So that's a tuple, and then drops it into tuple tags.

Skip to 8 minutes and 15 secondsWe have the record_parser, which first parses the name of the record, and then it has braces, and then it has a comma-separated list of key values. So for key values, I have an additional parser, kvparser, here, which parses an identifier, equal sign, and then whatever. And then it returns this wrapped in a tag element, and then it has keys and values. And finally, we parse the type_identifier. So type_identifiers are defined as strings that start with an uppercase letter and then one or more alphanumeric letters. This is our Parser that we use to parse the labels. So with all these things together, we can actually parse a show output and turn it into XML.

Skip to 9 minutes and 0 secondsSo of course, we need to change our main program a little bit to make that happen. First of all, we must import the parser that we just wrote. And then, obviously, we must use it, but that's very simple. So we change the main program just a little bit.

Skip to 9 minutes and 22 secondsSo we call show on the records that we created and put the result in the record string. We call ParseShow on the record string and that's our main program.

Skip to 9 minutes and 35 secondsSo now we can try this out. And, indeed, we have created an XML document that represents the output of show called on our custom data structure. So this is an example of how you can use Parsec to easily transform strings into different strings, for instance, XML documents.

Parsing using Parsec: a practical example

The aim of this tutorial it to explain step by step how to build a simple parser using the Parsec library.

The source code can be found on GitHub

Parsing the output of derived Show

The polymorphic function show returns a string representation of any data type that is an instance of the type class Show. The easiest way to make a data type an instance of type class is using the deriving clause.

    show :: Show a => a -> String
    data D = D ... deriving (Show)
    d :: D
    d = D ...
    str :: String
    str = show d -- string representation of the instance of the data type

In this tutorial we will show how to create a parser that will parse the output of a derived show and return it in XML format.

    parseShow :: String -> String
    xml = parseShow $ show res

Example data type

First we create a data type PersonRecord

data PersonRecord  = MkPersonRecord {
    name :: String,
    address :: Address,
    id :: Integer,
    labels :: [Label]    
} deriving (Show)

The types Address and Label are defined as follows:

data Address = MkAddress {
    line1 :: String,
    number :: Integer,
    street :: String,
    town :: String,
    postcode :: String
} deriving (Show)

data Label = Green | Red | Blue | Yellow deriving (Show)

We derive Show using the deriving clause. The compiler will automatically create the show function for this data type. Our parser will parse the output of this automatically derived show function.

Then we create some instances of PersonRecord:

rec1 = MkPersonRecord 
    "Wim Vanderbauwhede" 
    (MkAddress "School of Computing Science" 17 "Lilybank Gdns" "Glasgow" "G12 8QQ")
    557188
    [Green, Red]

rec2 = MkPersonRecord 
    "Jeremy Singer" 
    (MkAddress "School of Computing Science" 17 "Lilybank Gdns" "Glasgow" "G12 8QQ")
    42
    [Blue, Yellow]

We can test this very easily:

main = putStrLn $ show [rec1,rec2]    

This program produces the following output:

    [wim@workai HaskellParsecTutorial]$ runhaskell test_ShowParser_1.hs 
    [MkPersonRecord {name = "Wim Vanderbauwhede", address = MkAddress {line1 = "School of Computing Science", number = 17, street = "Lilybank Gdns", town = "Glasgow", postcode = "G12 8QQ"}, id = 557188, labels = [Green,Red]},MkPersonRecord {name = "Jeremy Singer", address = MkAddress {line1 = "School of Computing Science", number = 17, street = "Lilybank Gdns", town = "Glasgow", postcode = "G12 8QQ"}, id = 42, labels = [Blue,Yellow]}]

The derived Show format can be summarized as follows:

  • Lists: [… comma-separated items …]
  • Records: { ... comma-separated key-value pairs ...}
  • Strings: “…”
  • Algebraic data types: variant type name

Building the parser

We create a module ShowParser which exports a single function parseShow:

module ShowParser ( parseShow ) where

Some boilerplate:

    import Text.ParserCombinators.Parsec
    import qualified Text.ParserCombinators.Parsec.Token as P
    import Text.ParserCombinators.Parsec.Language

The Parsec.Token module provides a number of basic parsers. Each of these takes as argument a lexer, generated by makeTokenParser using a language definition. Here we use emptyDef from the Language module.

It is convenient to create a shorter name for the predefined parsers you want to use, e.g.

    parens = P.parens lexer
    -- and similar

The parser

The function parseShow takes the output from show (a String) and produces the corresponding XML (also a String). It is composed of the actual parser showParser and the function run_parser which applies the parser to a string.

    parseShow :: String -> String
    parseShow = run_parser showParser

    showParser :: Parser String

    run_parser :: Parser a -> String -> a
    run_parser p str =  case parse p "" str of
        Left err -> error $ "parse error at " ++ (show err)
        Right val  -> val  

The XML format

We define an XML format for a generic Haskell data structure. We use some helper functions to create XML tags with and without attributes.

<?xml version="1.0" encoding="utf-8"?>
    xml_header =  "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"

Tags:

<tag> ... </tag>
    otag t = "<"++t++">"
    ctag t = "</"++t++">"
    tag t v = concat [otag t,v,ctag t]

Attributes:

<tag attr1="..." attr2="...">
    tagAttrs :: String -> [(String,String)] -> String -> String 
    tagAttrs t attrs v = 
        concat [
            otag (unwords $ [t]++(map (\(k,v) -> concat [k,"=\"",v,"\""]) attrs))
            ,v
            ,ctag t
            ]

We also use some functions to join strings together. From the Prelude we take:

    concat :: [[a]] -> [a] -- join lists
    unwords :: [String] -> String -- join words using spaces

We also define a function to join strings with newline characters:

    joinNL :: [String] -> String -- join lines using "\"

This is identical to unlines from the Prelude, just to illustrate the use of intercalate and the Data.List module.

Parsers for the derived Show format

Lists

[ ..., ..., ... ]

XML :

<list>
<list-elt>...</list-elt>
...
</list>
    list_parser = do
        ls <- brackets $ commaSep showParser 
        return $ tag "list" $ joinNL $ map (tag "list-elt") ls

Tuples

: ( ..., ..., ... )

XML:

<tuple>
<tuple-elt>...</tuple-elt>
...
</tuple>
tuple_parser = do
    ls <- parens $ commaSep showParser 
    return $ tag "tuple" $ unwords $ map (tag "tuple-elt") ls

Record types

Rec { k=v, ... }

XML:

<record>
<elt key="k">v</elt>
...
</record>

key-value pairs: k = v -- v can be anything
record_parser = do
    ti <- type_identifier
    ls <- braces $ commaSep kvparser
    return $ tagAttrs "record" [("name",ti)] (joinNL ls)

kvparser = do
    k <- identifier
    symbol "="
    t <- showParser
    return $ tagAttrs "elt" [("key",k)] t
    
type_identifier = do
    fst <- oneOf ['A' .. 'Z']
    rest <- many alphaNum
    whiteSpace
    return $ fst:rest    

Algebraic data types

e.g. Label

XML:

<adt>Label</adt>
adt_parser = do
    ti <- type_identifier 
    return $ tag "adt" ti

Quoted strings and numbers

quoted_string = do
    s <- stringLiteral
    return $ "\""++s++"\""

number = do
    n <- integer
    return $ show n

Complete parser

Combine all parsers using the choice combinator <|>.

    showParser :: Parser String 
    showParser =
        list_parser <|> -- [ ... ]
        tuple_parser <|> -- ( ... )
        try record_parser <|> -- MkRec { ... }
        adt_parser <|> -- MkADT ...
        number <|>    -- signed integer
        quoted_string <?> "Parse error"

Parsec will try all choices in order of occurrence. Remember that try is used to avoid consuming the input.

Main program

Import the parser module

    import ShowParser (parseShow)

Use the parser

    rec_str =  show [rec1,rec2]    
    main = putStrLn $ parseShow rec_str

Test it:

[wim@workai HaskellParsecTutorial]$ runhaskell test_ShowParser.hs 
<?xml version="1.0" encoding="UTF-8"?>
<list><list-elt><record name="MkPersonRecord"><elt key="name">"Wim Vanderbauwhede"</elt>
<elt key="address"><record name="MkAddress"><elt key="line1">"School of Computing Science"</elt>
<elt key="number">17</elt>
<elt key="street">"Lilybank Gdns"</elt>
<elt key="town">"Glasgow"</elt>
<elt key="postcode">"G12 8QQ"</elt></record></elt>
<elt key="id">557188</elt>
<elt key="labels"><list><list-elt><adt>Green</adt></list-elt>
<list-elt><adt>Red</adt></list-elt></list></elt></record></list-elt>
<list-elt><record name="MkPersonRecord"><elt key="name">"Jeremy Singer"</elt>
<elt key="address"><record name="MkAddress"><elt key="line1">"School of Computing Science"</elt>
<elt key="number">17</elt>
<elt key="street">"Lilybank Gdns"</elt>
<elt key="town">"Glasgow"</elt>
<elt key="postcode">"G12 8QQ"</elt></record></elt>
<elt key="id">42</elt>
<elt key="labels"><list><list-elt><adt>Blue</adt></list-elt>
<list-elt><adt>Yellow</adt></list-elt></list></elt></record></list-elt></list>

Summary

  • Parsec makes it easy to build powerful text parsers from building blocks using predefined parsers and parser combinators.
  • The basic structure of a Parsec parser is quite generic and reusable
  • The example shows how to parse structured text (output from Show) and generate an XML document containing the same information.

Share this video:

This video is from the free online course:

Functional Programming in Haskell: Supercharge Your Coding

University of Glasgow

Course highlights Get a taste of this course before you join:

  • Haskell history poster
    Brief History of Haskell
    article

    What are the origins of the Haskell programming language? In this article, Dr Jeremy Singer explores the history of Haskell.