Eager streaming of XML/HTML data!

Submitted by metaperl on Fri, 04/08/2005 - 1:29pm.

HTML::Mason (www.masonhq.com) is a Perl module that would make most Haskellers cringe. It is the most confused domain-specific-language you can see: Perl embedded in HTML.

But Mason is in wide use at amazon.com. In addition I just interviewed for a Perl poisition at the largest Oracle database installation in the world. And they are planning to move to Mason for HTML/XML/SOAP delivery... why would they do this? Because Mason can stream HTML data: instead of building up an entire XML document and then sending it, it can stream parts. When the document is _huge_ this becomes crucially important: render some of the xml, ship it off, reclaim the memory and keep on truckin'...

How could a lazy functional language achieve the design goal eagerly streaming data as opposed to lazily producing it on demand?

Submitted by Cale Gibbard on Sun, 04/10/2005 - 2:58pm.

Lazily producing data on demand is streaming. Try printing an infinite list if you don't believe me. :) It's being too strict that would be the problem. If requesting part of the string causes the whole string to be constructed, you would lose any effect of streaming the data.

This problem arises in strictly evaluated languages, as innermost functions are applied first. This means that in the naive way of writing the code, by the time you even see the procedure which writes the socket, all of the data has already been generated. In order to get around this, various parts of the processing must be interleaved so that small fragments of the output can be generated and then written to the socket (or file/tty/etc.).

In Haskell, the IO action which writes to the socket fires, and it generates demand for the evaluation of the string to print, collecting enough to fill the buffer, and writing it out. (With no buffering, this would be a single character.) If the written part of the string is no longer referenced, it can then be garbage collected. If only part of the data is ever needed, only part of it is generated in the first place.

So you would just write the code which generates the XML in such a way that prefixes of the string can be computed immediately so that while writing the whole string to the socket, the content is generated as the write demands more of the input string. This is actually fairly natural, as opening tags can generally be written right away without knowing anything about the children.

So really, you want lazy evaluation for streaming things. This is considered somewhat tricky in a strict language, but you don't have to work too hard at it in a lazily evaluated one -- in fact, just write the code in the first way that comes to mind (and works!) and you'll probably get this feature.

Submitted by Anonymous (not verified) on Thu, 07/07/2005 - 4:50am.

But isn't that the problem with processing XML?

In order to validate it, you have to read the last tag to see if it matches the first.

You have to read the open tag before you can read the close tag which sequences that operation, and in between, you have to recurse on the contents.

Doesn't that defeat lazy reading?

Submitted by Cale Gibbard on Mon, 04/16/2007 - 9:27pm.

Normally it would, supposing that you care about actually ensuring that the XML file isn't broken (which might be seen as a flaw in XML), but if you break the thing up into a stream of open and close tags (like Neil Mitchell's Tag Soup: http://www-users.cs.york.ac.uk/~ndm/tagsoup/), it's possible to do better.

On producing output, you can certainly construct a text stream from an XML tree lazily with little problem.

Submitted by jgoerzen on Mon, 04/11/2005 - 7:27am.

Yes, I'd echo these comments. In Haskell, the default behavior should be what you want. You'd have to go to special effort to make it anything else.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.