Simple Parsec Example: HTMangL

Submitted by mrd on Sun, 01/28/2007 - 11:01pm.
::

Writing bird-style Literate Haskell is working out pretty well for me. Actually, I prefer latex-style, but bird-style works well for short articles with little to no math. Bird-style is also fairly easy to integrate with HTML markup; except for the fact that it uses '>' to designate code. Browsers vary, but most seem to handle the non-XHTML compliant markup with few snags. However, I got sick of dealing with those snags as they can be difficult to spot on occasion.

I put together a small program to deal with this mess, and also as a small Parsec usage tutorial. This program processes bird-style Literate Haskell input and outputs a converted version with the code-blocks surrounded by <code> tags while also converting >, <, and & to entity designators.

> module Main where > import Text.ParserCombinators.Parsec

A few combinators are built up from smaller ones. eol and tilEOL parse the remaining characters in a line up to and including the newline character (or EOF).

> eol = newline <|> (eof >> return '\n') > tilEOL = manyTill (noneOf "\n") eol

A line of code begins with "> " and continues til EOL.

> codeLine = do > string "> " > code <- tilEOL > return $ "> " ++ code

A non-blank literate line can begin with any character but newline, and if it begins with '>' then it cannot be followed by a space. To those coming from imperative backgrounds, the return () does not return from the function but rather returns () to the monad; here it is used as a no-op. The rest of the line is treated as above.

> litLine = do > ch <- noneOf "\n" > if ch == '>' then > notFollowedBy space > else > return () > text <- tilEOL > return $ ch:text

A blank line is one which begins with a newline.

> blankLine = char '\n' >> return ""

Blocks of code and literate lines (or blanks) are simply multiple consecutive lines (at least 1).

> code = many1 (try codeLine) > lit = many1 (try litLine <|> blankLine)
> data LiterateCode = Literate [String] > | Code [String] > deriving (Show, Eq)

A literate Haskell file is composed of many Code and Literate blocks. These are unified in one disjoint type, LiterateCode, and the combinator below ensures that the appropriate tag is applied to the results of parsing.

> literateCode = many (Code `fmap` code <|> Literate `fmap` lit)

A block of literate text is printed literally, but code must be processed slightly.

> printBlock (Literate ls) = mapM_ putStrLn ls > printBlock (Code cs) = do > putStrLn "<code>\n" > mapM_ (putStrLn . subEntities) cs > putStrLn "\n</code><br/>"

In case you were wondering how this works: it maps the function over each character in the input string and concatenates the resulting list of strings.

> subEntities = (>>= \c -> > case c of > '>' -> "&gt;" > '<' -> "&lt;" > '&' -> "&amp;" > c -> [c])

Really simple: work on stdin, print to stdout.

> main = do > s <- getContents > case parse literateCode "stdin" s of > Left err -> putStr "Error: " >> print err > Right cs -> mapM_ printBlock cs

Naturally, the first candidate code to run this program on is this program itself: I call it HTMangL.

Submitted by Alan Falloon on Mon, 01/29/2007 - 9:18am.

This is a simpler version of your mangler. I don't think you need to have
the parser or all the monadic code (I formatted this comment with my
version):


> module Main where
> import Data.List (groupBy)

The first thing we need is a function to detect a Bird-style code line:


> isCode :: String -> Bool
> isCode ('>':' ':_) = True
> isCode _ = False

Now, a function to detect if two adjacent lines are both code, both text,
or different.


> contentEqual :: String -> String -> Bool
> contentEqual x y = isCode x == isCode y

Now we need a function that can convert control characters to thier HTML
escapes. This function is lifted directly from your version, but its
converted into a simple string filter instead of a monadic action.


> subEntities :: String -> String
> subEntities x = concatMap tr x
> where
> tr '>' = "&gt;"
> tr '<' = "&lt;"
> tr '&' = "&amp;"
> tr c = [c]

From this we can make a function that takes the lines in a block of code
and returns the new list of lines HTMLized.


> fixCode :: [String] -> [String]
> fixCode l = ["<code>"] ++ map subEntities l ++ ["</code>"]

Now, a function that takes in the lines, and spits out the mangled lines.


> processLines :: [String] -> [String]
> processLines inp = outp
> where
> blocks = groupBy contentEqual inp
> fix block | isCode $ head block = fixCode block
> | otherwise = block
> outp = concatMap fix blocks

Finally, a main function to read in stdin, split it into its lines, process
them and write out their result.


> main :: IO ()
> main = do c <- getContents
> mapM_ putStrLn $ processLines $ lines c

Submitted by mrd on Mon, 01/29/2007 - 9:59am.

Very nice. The parser is in fact, unnecessary, something I should have clarified; since the language is regular, a FSM would have sufficed. Really, I felt I could kill two birds with one stone here: a parsec example, and something to assist with HTML/literate-code.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.