Parser with Writer monad

Submitted by jmuk on Wed, 11/29/2006 - 1:07am.
::

I tried to write a parser for bencode format. Bencode is normally used in BitTorrent only, but can be used for generally purpose.

In order to write the parser with parser combinators libraries such as Parsec, I encounter a problem. I applied Packrat Parsing to JSON parser, but I may use a sledge hummer to kill a fly. JSON is a simple format, it is very easy to parse with normal string-related functions. Packrat Parsing is too strong mechanism.

However, my point is not power of parser library, but memory copying. Normally, the parse result is copied from the original text by parser library. On the other hand, ByteString head, tail, init, last, take, drop, and similar functions do not copy any text data, but just modifies the offset and length of the text. If the parsed data has many text data, the parser must do the memory copy that is not potentially required.

The parser combinator library that does this is a challengeable task -- but, here I applies a small approach to parse bencode format. See http://darcs.haskell.org/SoC/haskellnet/Text/Bencode.hs

Here, all parsers are Writer monad.
Writer monad does `Computations which produce a stream of data in addition to the computed values' (came from All about monads). In parser context, `a stream' corresponds to the parsed result, and the computed value to rest of text.

We can easily combines the parser for `list of nodes' or `dictionary of nodes', using a function untilM :: (a -> Bool) -> (a -> m a) -> a -> m a.

It is true that this simplest approach lacks some important features. For example, this parser do not contains line and character information, which means that the parse error cannot print appropriate debug info. In despite of such fault, this parser works well for our purpose. Debug information will be the next step.
BTW, debug info is not a serious problem for bencode parser, because this format is not human-readable. In such case, the required information is whether the format is valid or not.