Parsing

Haskell and XML

Submitted by mrd on Sun, 08/05/2007 - 8:59am.

I installed HXT this week to use in a small project which dealt with some XML data. It is a slick library, and it has introduced me to the joy of Arrows. I scavenged for tutorial information, and there are some useful pages out there -- The Haskellwiki page on HXT, and Neil Bartlett's recent blog entry.

However, when it came down to getting some practical work done, I found myself scouring the HXT Arrow API documentation. While it is pretty neat to figure out which combinator to use solely based on the examination of types, I would have preferred a selection of code examples.

That is what Practical Examples of HXT in Action is attempting to be. I've concocted 3 articles so far, and I'm hoping to add more. The current examples demonstrate: arrow syntax, basic useful combinators, a little development methodology, optional and list data, and simple integration with Network.HTTP.

Simple Parsec Example: HTMangL

Submitted by mrd on Sun, 01/28/2007 - 11:01pm.
::

Writing bird-style Literate Haskell is working out pretty well for me. Actually, I prefer latex-style, but bird-style works well for short articles with little to no math. Bird-style is also fairly easy to integrate with HTML markup; except for the fact that it uses '>' to designate code. Browsers vary, but most seem to handle the non-XHTML compliant markup with few snags. However, I got sick of dealing with those snags as they can be difficult to spot on occasion.

I put together a small program to deal with this mess, and also as a small Parsec usage tutorial. This program processes bird-style Literate Haskell input and outputs a converted version with the code-blocks surrounded by <code> tags while also converting >, <, and & to entity designators.

> module Main where > import Text.ParserCombinators.Parsec

A few combinators are built up from smaller ones. eol and tilEOL parse the remaining characters in a line up to and including the newline character (or EOF).

> eol = newline <|> (eof >> return '\n') > tilEOL = manyTill (noneOf "\n") eol

A line of code begins with "> " and continues til EOL.

> codeLine = do > string "> " > code <- tilEOL > return $ "> " ++ code

A non-blank literate line can begin with any character but newline, and if it begins with '>' then it cannot be followed by a space. To those coming from imperative backgrounds, the return () does not return from the function but rather returns () to the monad; here it is used as a no-op. The rest of the line is treated as above.

> litLine = do > ch <- noneOf "\n" > if ch == '>' then > notFollowedBy space > else > return () > text <- tilEOL > return $ ch:text

A blank line is one which begins with a newline.

> blankLine = char '\n' >> return ""

Blocks of code and literate lines (or blanks) are simply multiple consecutive lines (at least 1).

> code = many1 (try codeLine) > lit = many1 (try litLine <|> blankLine)
> data LiterateCode = Literate [String] > | Code [String] > deriving (Show, Eq)

A literate Haskell file is composed of many Code and Literate blocks. These are unified in one disjoint type, LiterateCode, and the combinator below ensures that the appropriate tag is applied to the results of parsing.

> literateCode = many (Code `fmap` code <|> Literate `fmap` lit)

A block of literate text is printed literally, but code must be processed slightly.

> printBlock (Literate ls) = mapM_ putStrLn ls > printBlock (Code cs) = do > putStrLn "<code>\n" > mapM_ (putStrLn . subEntities) cs > putStrLn "\n</code><br/>"

In case you were wondering how this works: it maps the function over each character in the input string and concatenates the resulting list of strings.

> subEntities = (>>= \c -> > case c of > '>' -> "&gt;" > '<' -> "&lt;" > '&' -> "&amp;" > c -> [c])

Really simple: work on stdin, print to stdout.

> main = do > s <- getContents > case parse literateCode "stdin" s of > Left err -> putStr "Error: " >> print err > Right cs -> mapM_ printBlock cs

Naturally, the first candidate code to run this program on is this program itself: I call it HTMangL.

Parser with Writer monad

Submitted by jmuk on Wed, 11/29/2006 - 1:07am.
::

I tried to write a parser for bencode format. Bencode is normally used in BitTorrent only, but can be used for generally purpose.

In order to write the parser with parser combinators libraries such as Parsec, I encounter a problem. I applied Packrat Parsing to JSON parser, but I may use a sledge hummer to kill a fly. JSON is a simple format, it is very easy to parse with normal string-related functions. Packrat Parsing is too strong mechanism.

However, my point is not power of parser library, but memory copying. Normally, the parse result is copied from the original text by parser library. On the other hand, ByteString head, tail, init, last, take, drop, and similar functions do not copy any text data, but just modifies the offset and length of the text. If the parsed data has many text data, the parser must do the memory copy that is not potentially required.

The parser combinator library that does this is a challengeable task -- but, here I applies a small approach to parse bencode format. See http://darcs.haskell.org/SoC/haskellnet/Text/Bencode.hs

Here, all parsers are Writer monad.
Writer monad does `Computations which produce a stream of data in addition to the computed values' (came from All about monads). In parser context, `a stream' corresponds to the parsed result, and the computed value to rest of text.

We can easily combines the parser for `list of nodes' or `dictionary of nodes', using a function untilM :: (a -> Bool) -> (a -> m a) -> a -> m a.

It is true that this simplest approach lacks some important features. For example, this parser do not contains line and character information, which means that the parse error cannot print appropriate debug info. In despite of such fault, this parser works well for our purpose. Debug information will be the next step.
BTW, debug info is not a serious problem for bencode parser, because this format is not human-readable. In such case, the required information is whether the format is valid or not.

packrat parsing, and idea of `ParserT'

Submitted by jmuk on Wed, 08/16/2006 - 1:05am.

I use Packrat Parsing for the parser of HaskellNet.
see, http://darcs.haskell.org/SoC/haskellnet/Text/
and, more detailed information of Packrat Parsing are seen at http://pdos.csail.mit.edu/~baford/packrat/

Packrat Parsing is easy to use and develop.

Of course, there already exists Parsec. However, Parsec cannot be applied with ByteString. Parsec are intended for lists of tokens. With Packrat Parsing, we must define dvChar -- a function to calculate the `next' character of a stream -- by ourselves, so it is easier to use with ByteString. It's great.

Then, I want a kind of `ParserT'. For example, IMAP server response may includes `status updates' informations. Server may respond as follows:

* 22 EXPUNGE
* 23 EXISTS
* 3 RECENT
* 14 FETCH (FLAGS (\Seen \Deleted))
* CAPABILITY IMAP4rev1 STARTTLS AUTH=GSSAPI LOGINDISABLED
abcd OK CAPABILITY completed

This response is primarily for `capability' of the server, but this response also includes the `new' information of current mailbox.

Currently, I split these responses as (ServerResponse, MailboxUpdate, ResponseData). ServerResponse is OK, BAD, NO and so on. ResponseData is [String] in this case. MailboxUpdate is Recent 3 and so on. After parsed, the connection data updates its mailbox information.

If we have MonadTrans of Parser, we can update the mailbox information at the time of parsing `3 Recent'. Then, there are no need to prepare MailboxUpdate type and such confusing structure.
In other cases, `ParserT' seems to be useful.

I thought this problem a little but it seems difficult. Are there any ideas?
I'd like to think later...

instance Read a, given a Parsec parser

Submitted by dtlin on Fri, 10/07/2005 - 1:46pm.

Given any type a, for which there exists an parsecRead :: Parser a, the instance Read a can be defined really easily.
But I didn't find it in the Parsec docs.  Maybe I wasn't looking hard enough.

{- This code is dedicated to the public domain. -}
import Text.ParserCombinators.Parsec
class ParsecRead a where
  parsecRead :: Parser a
instance (ParsecRead a) => Read a where
  readsPrec _ = either (const []) id . parse parsecRead' "" where
    parsecRead' = do a <- parsecRead
                     rest <- getInput
                     return [(a, rest)]

It could be used like this:
data Foo = Foo
instance ParsecRead Foo where
  parsecRead = string "foo" >> return Foo
-- instance Read Foo

Defining instance Read a requires GHC's -fallow-undecidable-instances, though.
Well, you could always just repeat the readsPrec definition for everything you want, but it seems to me that this should exist just because it's convenient and common enough.

All About Monads, A comprehensive guide to the theory and practice of monadic programming in Haskell.

Submitted by shapr on Wed, 02/23/2005 - 12:44pm.

Jeff Newbern's All About Monads is the best monad tutorial I've seen yet!

This tutorial starts with the most basic definition of a monad, and why you might want one. It covers most of the monad instances in the standard libraries, and also includes monad transformers. It wraps up nicely with links to Parsec, category theory, and arrows. You can read it online, or download as a zip file or tarball.

If you've been looking for a good monads tutorial, try this one first!