Parsec Tutorial: How to Build a Text Parser in Haskell

This screencast by Wim Vanderbauwhede shows step by step how to build a text parser in Haskell using the Parsec library and how to create XML output.
5.2
WIM: In this tutorial, I want to explain how you can use the Parsec parser combinator library for practical parsing tasks.
16.1
You have learned before how to use the Show typeclass and the show function to print out Haskell data structures. So in this tutorial, I want to show you how you can create a parser for the output from the Show function and then turn it into XML, for example.
35.6
So let’s start by creating a few data structures.
43.2
So we create a data structure PersonRecord just as an example, where we have the name of the person, the address, some integer identifier, and a few labels. So the Address is a data structure in its own right and so are the labels. They are an algebraic data structure like this.
67.5
Note that we have used deriving Show so that you don’t have to write your own Show function. By using this deriving clause, GHC will automatically create the show function for these data structures. And it is the output of this automatically created show function that we are going to parse. So let’s create a few instances of this data type that we have here. So let’s say record 1 is, and record 2. And now we can use show on these two records, and this is the output that show gives us. So it’s this output that we now want to transform into XML using the Parsec library. So let’s go and build a parser using the Parsec library.
136.5
I’m creating a module showParser, which exports a single function ParseShow, which is going to parse the output from Show and turn it into XML. To use Parsec in practice, there are a number of modules that you usually need. First, we have the actual parser combinators, then we have a library of token parsers, and then we have a handy support library called Parsec.Language, which defines the basics of the programming language by defining what is an identifier, what is a comment, and things like that. But the Parsec.Token library exports a function makeTokenParser and this function takes a language definition.
180.6
Now, because we’re not really having a programming language where we use emptyDef, if you want to know more about the language definitions, I suggest you read the Parsec manual. So then, given this lexer, so basically, the definition of what are identifiers and so on in our system, we can now build a number of parsers very conveniently like this. So the library finds a number of parsers for a given lexer, and now you can actually parse parentheses, brackets, braces, comma-separated lists, white spaces, symbols being predefined strings, identifiers being whatever is an identifier in your language, integers, and so on. There are a lot more of them, and again, you can look at the Parsec library.
228.1
So with this boilerplate, we can start making the actual parser, and this will be a parser that takes a string, being the output from the Show function, and it produces a new string, which is the XML representation of this input string. So we have a function with signature, ParseShow, string to string. And the way this works is that we have a function that creates– that is our actual parser, and this parser is a monadic parser, which means, basically, that it consists of functions that you put together. So you actually produce one big combined function, and then you apply that function to the string. And to apply this function to the string, we use another function, run_parser, that looks as follows.
283.2
So run_parser is independent of the actual input of the parser. So you have a parser of a certain data type and a string, and it produces this output data type that you desire. So in our case, the data type that we desire is a string, so see here is a string. And what it does is it applies our parser to an input string, and then if there’s an error, it reports an error. Or if there’s no error, it returns the result. So this function, run_parser, is kind of generic. So you can always reuse it. In our case, the parser that we want to build, we will call showParser, and that’s a parser of string. So it returns a string.
328.3
So with this showParser and run_parser, we can now build our actual ParseShow function. So we say, ParseShow equals simply like this. So maybe we want to make this a little bit more complete. To emit valid XML, we can tag this with an actual XML header string. So let’s say, we define the XML header string as, and then we can actually change this function a little bit.
366.5
In the opening tag, we must actually create, not just the name of the tag, but also the list of attributes. So the list of attributes is a list of tuples for the name of the attribute and the value of the attribute. And so what we do is we apply a map of a function that turns this tuple into a string, where we have the key, and then equals quote, , value, quote. Now, so this approach here for string concatenation is very convenient, rather than using the ++ in terms of readability. So concat will simply concatenate all of the elements of a list. With these tag manipulation functions, we are almost ready to create the actual parses.
412.2
There’s another function that is very handy to use. It’s a function that will join elements of a list using newline. So let’s build that function. We have join. And I use the function intercalate. So intercalate is a function that takes all the elements of a list and joins them together, interspersed with the particular character that is the first argument of this function. But this function intercalate is not part of the standard prelude, so we have to add it to our import list here. With all this machinery ready, we can now create the rest of our parsers.
452.1
For instance, if we want to parse a list, so we parse brackets, and then a comma-separated value of whatever can be inside the list till we call showParser again. And the result is the list of the elements that we parsed. What we do then is we tag this list with the type list, and then we join them with newlines, and each element is surrounded by a tag list element. The other parsers are quite similar. So I’ll just put them all here. We have a tuple_parser, which parses parentheses and then comma-separated. So that’s a tuple, and then drops it into tuple tags.
495.4
We have the record_parser, which first parses the name of the record, and then it has braces, and then it has a comma-separated list of key values. So for key values, I have an additional parser, kvparser, here, which parses an identifier, equal sign, and then whatever. And then it returns this wrapped in a tag element, and then it has keys and values. And finally, we parse the type_identifier. So type_identifiers are defined as strings that start with an uppercase letter and then one or more alphanumeric letters. This is our Parser that we use to parse the labels. So with all these things together, we can actually parse a show output and turn it into XML.
539.7
So of course, we need to change our main program a little bit to make that happen. First of all, we must import the parser that we just wrote. And then, obviously, we must use it, but that’s very simple. So we change the main program just a little bit.
561.7
So we call show on the records that we created and put the result in the record string. We call ParseShow on the record string and that’s our main program.
575.2
So now we can try this out. And, indeed, we have created an XML document that represents the output of show called on our custom data structure. So this is an example of how you can use Parsec to easily transform strings into different strings, for instance, XML documents.

4.8

28 Reviews
The aim of this tutorial is to explain step by step how to build a simple parser using the Parsec library.
The source code can be found on GitHub

Parsing the output of derived Show

The polymorphic function show returns a string representation of any data type that is an instance of the type class Show. The easiest way to make a data type an instance of type class is using the deriving clause.

 show :: Show a => a -> String
data D = D ... deriving (Show)
d :: D
d = D ...
str :: String
str = show d -- string representation of the instance of the data type


In this tutorial we will show how to create a parser that will parse the output of a derived show and return it in XML format.

 parseShow :: String -> String
xml = parseShow $show res  Example data type First we create a data type PersonRecord data PersonRecord = MkPersonRecord { name :: String, address :: Address, id :: Integer, labels :: [Label] } deriving (Show)  The types Address and Label are defined as follows: data Address = MkAddress { line1 :: String, number :: Integer, street :: String, town :: String, postcode :: String } deriving (Show) data Label = Green | Red | Blue | Yellow deriving (Show)  We derive Show using the deriving clause. The compiler will automatically create the show function for this data type. Our parser will parse the output of this automatically derived show function. Then we create some instances of PersonRecord: rec1 = MkPersonRecord "Wim Vanderbauwhede" (MkAddress "School of Computing Science" 17 "Lilybank Gdns" "Glasgow" "G12 8QQ") 557188 [Green, Red] rec2 = MkPersonRecord "Jeremy Singer" (MkAddress "School of Computing Science" 17 "Lilybank Gdns" "Glasgow" "G12 8QQ") 42 [Blue, Yellow]  We can test this very easily: main = putStrLn$ show [rec1,rec2]


This program produces the following output:

 [wim@workai HaskellParsecTutorial]$runhaskell test_ShowParser_1.hs [MkPersonRecord {name = "Wim Vanderbauwhede", address = MkAddress {line1 = "School of Computing Science", number = 17, street = "Lilybank Gdns", town = "Glasgow", postcode = "G12 8QQ"}, id = 557188, labels = [Green,Red]},MkPersonRecord {name = "Jeremy Singer", address = MkAddress {line1 = "School of Computing Science", number = 17, street = "Lilybank Gdns", town = "Glasgow", postcode = "G12 8QQ"}, id = 42, labels = [Blue,Yellow]}]  The derived Show format can be summarized as follows: • Lists: [… comma-separated items …] • Records: { … comma-separated key-value pairs …} • Strings: “…” • Algebraic data types: variant type name Building the parser We create a module ShowParser which exports a single function parseShow: module ShowParser ( parseShow ) where  Some boilerplate:  import Text.ParserCombinators.Parsec import qualified Text.ParserCombinators.Parsec.Token as P import Text.ParserCombinators.Parsec.Language  The Parsec.Token module provides a number of basic parsers. Each of these takes as argument a lexer, generated by makeTokenParser using a language definition. Here we use emptyDef from the Language module. It is convenient to create a shorter name for the predefined parsers you want to use, e.g.  parens = P.parens lexer -- and similar  The parser The function parseShow takes the output from show (a String) and produces the corresponding XML (also a String). It is composed of the actual parser showParser and the function run_parser which applies the parser to a string.  parseShow :: String -> String parseShow = run_parser showParser showParser :: Parser String run_parser :: Parser a -> String -> a run_parser p str = case parse p "" str of Left err -> error$ "parse error at " ++ (show err)
Right val -> val


The XML format

We define an XML format for a generic Haskell data structure. We use some helper functions to create XML tags with and without attributes.

<?xml version="1.0" encoding="utf-8"?>


 xml_header = "<?xml version="1.0" encoding="UTF-8"?>n"


Tags:

<tag> ... </tag>


 otag t = "<"++t++">"
ctag t = "</"++t++">"
tag t v = concat [otag t,v,ctag t]


Attributes:

<tag attr1="..." attr2="...">


 tagAttrs :: String -> [(String,String)] -> String -> String
tagAttrs t attrs v =
concat [
return $tag "tuple"$ unwords $map (tag "tuple-elt") ls  Record types Rec { k=v, ... }  XML: <record> <elt key="k">v</elt> ... </record> key-value pairs: k = v -- v can be anything  record_parser = do ti <- type_identifier ls <- braces$ commaSep kvparser
return $tagAttrs "record" [("name",ti)] (joinNL ls) kvparser = do k <- identifier symbol "=" t <- showParser return$ tagAttrs "elt" [("key",k)] t

type_identifier = do
fst <- oneOf ['A' .. 'Z']
rest <- many alphaNum
whiteSpace
return $fst:rest  Algebraic data types e.g. Label  XML: <adt>Label</adt>  adt_parser = do ti <- type_identifier return$ tag "adt" ti


Quoted strings and numbers

quoted_string = do
s <- stringLiteral
return $"""++s++""" number = do n <- integer return$ show n


Complete parser

Combine all parsers using the choice combinator <|>.

 showParser :: Parser String
showParser =
list_parser <|> -- [ ... ]
tuple_parser <|> -- ( ... )
try record_parser <|> -- MkRec { ... }
number <|> -- signed integer
quoted_string <?> "Parse error"


Parsec will try all choices in order of occurrence.
Remember that try is used to avoid consuming the input.

Main program

Import the parser module

 import ShowParser (parseShow)


Use the parser

 rec_str = show [rec1,rec2]
main = putStrLn $parseShow rec_str  Test it: [wim@workai HaskellParsecTutorial]$ runhaskell test_ShowParser.hs
<?xml version="1.0" encoding="UTF-8"?>
<list><list-elt><record name="MkPersonRecord"><elt key="name">"Wim Vanderbauwhede"</elt>
<elt key="number">17</elt>
<elt key="street">"Lilybank Gdns"</elt>
<elt key="town">"Glasgow"</elt>
<elt key="postcode">"G12 8QQ"</elt></record></elt>
<elt key="id">557188</elt>
<list-elt><record name="MkPersonRecord"><elt key="name">"Jeremy Singer"</elt>
<elt key="number">17</elt>
<elt key="street">"Lilybank Gdns"</elt>
<elt key="town">"Glasgow"</elt>
<elt key="postcode">"G12 8QQ"</elt></record></elt>
<elt key="id">42</elt>


Summary

• Parsec makes it easy to build powerful text parsers from building blocks using predefined parsers and parser combinators.
• The basic structure of a Parsec parser is quite generic and reusable
• The example shows how to parse structured text (output from Show) and generate an XML document containing the same information.