News aggregator

FP Complete: QuickCheck and Magic of Testing

Planet Haskell - 10 hours 51 min ago

Haskell is an amazing language. With its extremely powerful type system and a pure functional paradigm it prevents programmers from introducing many kinds of bugs, that are notorious in other languages. Despite those powers, code is still written by humans, and bugs are inevitable, so writing quality test suites is just as important as writing an application itself.

Over the course of history buggy software has cost industry billions of dollars in damage and even lost human lives, so I cannot stress enough, how essential testing is for any project.

One of the ways to test software is through writing unit tests, but since it is not feasible to test all possible inputs exhaustively for most functions, we usually check some corner cases and occasionally test with other arbitrary values. Systematic generation of random input, that is biased towards corner cases, could be very helpful in that scenario, and that's where QuickCheck comes into play. This state of the art property testing library was originally invented in Haskell, and, because it turned out to be so powerful, it was later ported to other languages. However, the real power of random testing is unleashed when combined with purity of Haskell.

Contact us for information on training

EmailNameMessageContact me about FP Complete training

Or you can email us at


Let's start by looking at this exemplar properties of a reverse function:

reverse (reverse xs) == xs reverse (xs ++ ys) == reverse ys ++ reverse xs

We know, that they will hold for all finite lists with total values. Naturally, there are ways to prove them manually and there are even tools for Haskell, such as LiquidHaskell, that can help you automate proving some properties. Formal proof of correctness of a program is not always possible: some properties are either too hard or impossible to prove. Regardless of ability to prove a property of a function, we at least need to check that it works correctly on some finite set of inputs.

import Test.QuickCheck prop_RevRev :: Eq a => [a] -> Bool prop_RevRev xs = reverse (reverse xs) == xs prop_RevApp :: [Int] -> [Int] -> Bool prop_RevApp xs ys = reverse (xs ++ ys) == reverse ys ++ reverse xs

We can load those properties into GHCi and run quickCheck on them. Here is a quick way on how to do it from a terminal, and a detailed guide on how to get started with stack.

$ stack --resolver lts-7.16 ghci --package QuickCheck Configuring GHCi with the following packages: GHCi, version 8.0.1: :? for help Loaded GHCi configuration from /tmp/ghci3260/ghci-script Prelude> :load examples.hs [1 of 1] Compiling Main ( examples.hs, interpreted ) Ok, modules loaded: Main. *Main> quickCheck prop_RevRev +++ OK, passed 100 tests. *Main> quickCheck prop_RevApp +++ OK, passed 100 tests. *Main>

What just happened? QuickCheck called prop_RevRev and prop_RevApp 100 times each, with random lists as arguments and declared those tests as passing, because all calls resulted in True. Far beyond what a common unit test could have done.

Worth noting, that in reality, not only prop_RevRev, but both of those properties are polymorphic and quickCheck will be happy to work with such functions, even if type signatures were inferred, and it will run just fine in GHCi. On the other hand, while writing a test suite, we have to restrict the type signature for every property to concrete type, such as [Int] or Char, otherwise type checker will get confused. For example, this program will not compile:

import Test.QuickCheck main :: IO () main = quickCheck (const True)

For the sake of example let's write couple more self-explanatory properties:

prop_PrefixSuffix :: [Int] -> Int -> Bool prop_PrefixSuffix xs n = isPrefixOf prefix xs && isSuffixOf (reverse prefix) (reverse xs) where prefix = take n xs prop_Sqrt :: Double -> Bool prop_Sqrt x | x < 0 = isNaN sqrtX | x == 0 || x == 1 = sqrtX == x | x < 1 = sqrtX > x | x > 1 = sqrtX > 0 && sqrtX < x where sqrtX = sqrt x

Now, this is great, but how did we just pass various functions with different number of arguments of different types to quickCheck, and how did it know what to do with them? Let's look at it's type signature:

λ> :t quickCheck quickCheck :: Testable prop => prop -> IO ()Testable

So, it seems, that QuickCheck can test anything that is Testable:

λ> :i Testable class Testable prop where property :: prop -> Property exhaustive :: prop -> Bool instance [safe] Testable Property instance [safe] Testable prop => Testable (Gen prop) instance [safe] Testable Discard instance [safe] Testable Bool instance [safe] (Arbitrary a, Show a, Testable prop) => Testable (a -> prop)

The last instance is for a function (a -> prop), that returns a prop, which, in turn, must also be an instance of Testable. This magic trick of a recursive constraint for an instance definition allows quickCheck to test a function with any number of arguments, as long as each one of them is an instance of Arbitrary and Show. So here is a check list of requirements for writing a testable property:

  • Zero or more arguments, which have an instance of Arbitrary, that is used for generating random input. More on that later.
  • Arguments must also be an instance of Show, so if a test fails, offending value can be displayed back to a programmer.
  • Return value is either:

    • True/False - to indicate pass/fail of a test case.
    • Discard - to skip the test case (eg. precondition fails).
    • Result - to customize pass/fail/discard test result behavior, collect extra information about the test outcome, provide callbacks and other advanced features.
    • Property for a much finer control of test logic. Such properties can be used as combinators to construct more complex test cases.
    • Prop used to implement Property
  • Start with prop_ or prop, followed by the usual camelCase, but that is just a convention, not a requirement.
  • Has no side effects. Also not a requirement, but strongly suggested, since referential transparency is lost with IO and test results can be inconsistent between runs. At the same time there are capabilities for testing Monadic code, which we will not go into here.

Here is another very simple property of lists xs !! n == head (drop n xs), so let's define it as is:

prop_Index_v1 :: [Integer] -> Int -> Bool prop_Index_v1 xs n = xs !! n == head (drop n xs)

Naturally, you can see a problem with that function, it cannot accept just any random Int to be used for indexing, and quickCheck quickly finds that problem for us and prints out violating input along with an error:

λ> quickCheck prop_Index_v1 *** Failed! Exception: 'Prelude.!!: index too large' (after 1 test): [] 0

Interestingly, if you try to run this example on any computer, there is a very good chance that it will give exactly the same output, so, it seems that input to properties is not completely random. In fact, thanks to the function sized, the first input to our property will always be an empty list and an integer 0, which tend to be really good corner cases to test for. In our case, though, !! and head are undefined for empty lists and negative numbers. We could add some guards, but there are facilities provided for such common cases:

prop_Index_v2 :: (NonEmptyList Integer) -> NonNegative Int -> Bool prop_Index_v2 (NonEmpty xs) (NonNegative n) = xs !! n == head (drop n xs)

This version is still not quite right, since we do have another precondition n < length xs. However, it would be a bit complicated to describe this relation through the type system, so we will specify this precondition at a runtime using implication operator (⇒). Note, that return type has changed too:

prop_Index_v3 :: (NonEmptyList Integer) -> NonNegative Int -> Property prop_Index_v3 (NonEmpty xs) (NonNegative n) = n < length xs ==> xs !! n == head (drop n xs)

Test cases with values, that do not satisfy the precondition, will simply get discarded, but not to worry, it will still generate the 100 tests. In fact it will generate up to a 1000 before giving up. An alternate way to achieve similar effect would be to generate a valid index within a property itself:

prop_Index_v4 :: (NonEmptyList Integer) -> Property prop_Index_v4 (NonEmpty xs) = forAll (choose (0, length xs-1)) $ \n -> xs !! n == head (drop n xs)λ> quickCheck prop_Index_v3 >> quickCheck prop_Index_v4 +++ OK, passed 100 tests. +++ OK, passed 100 tests.

Just in case, let's quickly dissect this for all (∀) business. It takes a random value generator, which choose happens to produce, a property that operates on it's values and returns a property, i.e. applies values from a specific generator to the supplied property.

λ> :t forAll forAll :: (Show a, Testable prop) => Gen a -> (a -> prop) -> Property λ> sample' $ choose (0, 3) [0,2,2,3,3,3,0,1,0,1,3]

There is a very subtle difference between the last two versions, namely _v3 will discard tests that do not satisfy a precondition, while _v4 will always generate a value for n that is safe for passing to index function. This is not important for this example, which is good, but that is not always the case. Whenever precondition is too strict, QuickCheck might give up early while looking for valid values for a test, but more importantly, it can give a false sence of validity, since most of the values that it will find could be trivial ones.


For this section we will use prime numbers in our examples, but rather than reinventing the wheel and writing functions for prime numbers ourselves we will use primes package. Just for fun, let's write a property for primeFactors, which is based on Fundamental Theorem of Arithmetic:

prop_PrimeFactors :: (Positive Int) -> Bool prop_PrimeFactors (Positive n) = isPrime n || all isPrime (primeFactors n)

That was incredibly easy and is almost a direct translation of a theorem itself. Let's consider a fact that every prime number larger than 2 is odd, thus we can easily derive a property that sum of any two prime numbers greater than 2 is even. Here is a naive way to test that property:

prop_PrimeSum_v1 :: Int -> Int -> Property prop_PrimeSum_v1 p q = p > 2 && q > 2 && isPrime p && isPrime q ==> even (p + q)

As you can imagine it is not too often that a random number will be prime, this certainly will affect the quality of this test:

λ> quickCheck prop_PrimeSum_v1 *** Gave up! Passed only 26 tests.

It only found 26 satisfiable tests out of a 1000 generated, that's bad. There is even more to it, in order to convince ourselves, that we are testing functions with data that resembles what we expect in real life, we should always try to inspect the values being generated for a property. An easy way to do that is to classify them by some shared traits:

prop_PrimeSum_v1' :: Int -> Int -> Property prop_PrimeSum_v1' p q = p > 2 && q > 2 && isPrime p && isPrime q ==> classify (p < 20 && q < 20) "trivial" $ even (p + q)λ> quickCheck prop_PrimeSum_v1' *** Gave up! Passed only 29 tests (96% trivial). λ> quickCheckWith stdArgs { maxSuccess = 500 } prop_PrimeSum_v1' *** Gave up! Passed only 94 tests (44% trivial).

Almost all values this property was tested on are in fact trivial ones. Increasing number of tests was not much of a help, because, by default, values generated for integers are pretty small. We could try to fix that with appropriate types, but this time we will also generate a histogram of unique pairs of discovered prime numbers:

prop_PrimeSum_v2 :: (Positive (Large Int)) -> (Positive (Large Int)) -> Property prop_PrimeSum_v2 (Positive (Large p)) (Positive (Large q)) = p > 2 && q > 2 && isPrime p && isPrime q ==> collect (if p < q then (p, q) else (q, p)) $ even (p + q)λ> quickCheck prop_PrimeSum_v2 *** Gave up! Passed only 24 tests: 16% (3,3) 8% (11,41) 4% (9413,24019) 4% (93479,129917) ...

This is better, there are less trivial values, but still, number of tests is far from satisfactory. It is also extremely inefficient to look for prime values that way, and, for any really large value passed to the property, it will take forever to check its primality. Much better approach would be to choose from a list of prime values, which we have readily available for us:

prop_PrimeSum_v3 :: Property prop_PrimeSum_v3 = forAll (choose (1, 1000)) $ \ i -> forAll (choose (1, 1000)) $ \ j -> let (p, q) = (primes !! i, primes !! j) in collect (if p < q then (p, q) else (q, p)) $ even (p + q)λ> quickCheck prop_PrimeSum_v3 +++ OK, passed 100 tests: 1% (983,6473) 1% (953,5059) 1% (911,5471) ...Arbitrary

There could be a scenario where we needed prime values for many tests, then it would be a burden to generate them this way for each property. In such cases solution is always to write an instance for Arbitrary:

newtype Prime a = Prime a deriving Show instance (Integral a, Arbitrary a) => Arbitrary (Prime a) where arbitrary = do x <- frequency [ (10, choose (0, 1000)) , (5, choose (1001, 10000)) , (1, choose (10001, 50000)) ] return $ Prime (primes !! x)

Calculating large prime numbers is pretty expensive, so we could simply use something like choose (0, 1000), similarly to how it was done in prop_PrimeSum_v3, but there is no reason why we should exclude generating large prime numbers completely, instead, we can reduce their chance by describing a custom distribution with frequency function.

Now writing prop_PrimeSum is a piece of cake:

prop_PrimeSum_v4 :: Prime Int -> Prime Int -> Property prop_PrimeSum_v4 (Prime p) (Prime q) = p > 2 && q > 2 ==> classify (p < 1000 || q < 1000) "has small prime" $ even (p + q)λ> quickCheck prop_PrimeSum_v4 +++ OK, passed 100 tests (21% has small prime).CoArbitrary

There are quite a few instances of Arbitrary, many common data types from base are, but the most peculiar one is a function:

λ> :i Arbitrary class Arbitrary a where arbitrary :: Gen a shrink :: a -> [a] ... instance [safe] (CoArbitrary a, Arbitrary b) => Arbitrary (a -> b) ...

That's right, QuickCheck can even generate functions for us! One of restrictions is that an argument to the function is an instance of CoArbitrary, which also has instance for a function, consequently functions of any arity can be generated. Another caveat is that we need an instance of Show for functions, which is not a standard practice in Haskell, and wrapping a function in a newtype would be more appropriate. For clarity we will opt out from this suggestion and instead demonstrate this cool feature in action. One huge benefit is that it allows us to easily write properties for higher order functions:

instance Show (Int -> Char) where show _ = "Function: (Int -> Char)" instance Show (Char -> Maybe Double) where show _ = "Function: (Char -> Maybe Double)" prop_MapMap :: (Int -> Char) -> (Char -> Maybe Double) -> [Int] -> Bool prop_MapMap f g xs = map g (map f xs) == map (g . f) xsHSpec

One of the first concerns, that programmers usually raise when coming from other languages to Haskell, is that there are situations when unit tests are invaluable, but QuickCheck does not provide an easy way to do that. Bear in mind, QuickCheck's random testing is not a limitation, but rather is a priceless feature of testing paradigm in Haskell. Regular style unit tests and other QA functionality (code coverage, continuous integration, etc.) can be done just as easily as they are done in any other modern language using specialized libraries. In fact, those libraries play beautifully together and complement each other in many ways.

Here is an example of how we can use hspec to create a test suite containing all properties we have discussed so far, plus few extra unit tests for completeness of the picture.

#!/usr/bin/env stack -- stack --resolver lts-7.16 runghc --package QuickCheck --package hspec --package primes module Main where import Test.Hspec import Test.QuickCheck ... main :: IO () main = hspec $ do describe "Reverse Properties" $ do it "prop_RevRev" $ property prop_RevRev it "prop_RevApp" $ property prop_RevApp it "prop_PrefixSuffix" $ property prop_PrefixSuffix describe "Number Properties" $ do it "prop_Sqrt" $ property prop_Sqrt describe "Index Properties" $ do it "prop_Index_v3" $ property prop_Index_v3 it "prop_Index_v4" $ property prop_Index_v4 it "unit_negativeIndex" $ shouldThrow (return $! ([1,2,3] !! (-1))) anyException it "unit_emptyIndex" $ shouldThrow (return $! ([] !! 0)) anyException it "unit_properIndex" $ shouldBe (([1,2,3] !! 1)) 2 describe "Prime Numbers" $ do it "prop_PrimeFactors" $ property prop_PrimeFactors it "prop_PrimeSum_v3" $ property prop_PrimeSum_v3 it "prop_PrimeSum_v4" $ property prop_PrimeSum_v4 describe "High Order" $ do it "prop_MapMap" $ property prop_MapMapConclusion

Random testing can be mistakenly regarded as an inferior way of software testing, but many studies have certainly shown that it is not the case. To quote D. Hamlet:

By taking 20% more points in a random test, any advantage a partition test might have had is wiped out.

It is very easy to start using QuickCheck to test properties of pure functions. There is also a very similar toolbox included in the library for testing monadic functions, thus allowing for a straightforward way of testing properties of functions that do mutations, depend on state, run concurrently and even perform I/O. Most importantly, this library provides yet another technique for making Haskell programs even safer.

Writing tests doesn't have to be a chore, it can be fun. We certainly find it fun at FPComplete and will be happy to provide training, consulting or development work.

Contact us for information on consulting

EmailNameMessageContact us for information on consulting

Or you can email us at

Further reading</html>
Categories: Offsite Blogs

Michael Snoyman: Stackage design choices: making Haskell curated package sets

Planet Haskell - Sun, 01/22/2017 - 6:00pm

This post is going to talk about some of the design choices made over the years around the Stackage project, a curated package set for Haskell. While many of these points will be Haskell- and Stackage-specific, I think the ideas would translate well to other languages interested in created curated package sets. This blog post was inspired by a short discussion on Twitter, which made it clear that I'd never really shared design thoughts on the Stackage project:

@snoyberg @iElectric sounds like you've put already a lot of thought into that. would luv to learn more about!

— Haskell Dev (@haskdev) January 16, 2017 <script async="async" charset="utf-8" src=""></script>

In understanding why Stackage is the way it is today, it will be important to take into account:

  • The goals of the project
  • The historical circumstances when decisions were made
  • Social pressures in the community agitating for specific decisions
  • Inertia in the project making significant changes difficult

Apologies in advance, this turned out longer than I'd intended.


Before Stackage, the most common way to find a set of libraries to use in a Haskell project was using cabal-install's dependency solver, based on bounds information specified by authors. There were certainly some efforts at creating curated package sets previously (Haskell Platform provided a limited set; the Yesod Platform provided a full set of packages for the Yesod Web Framework; various Linux distros had binary packages). But I think it's fair to say that the vast majority of people writing Haskell code were using dependency solving.

I'm not going to get into the argument of dependency solving vs curation here. I will simply say that for many people - myself included - having official combinations of packages which are known to compile together, and which can be given to end users and teammates on a project, was very appealing. This was the motivation for my initial Stackage call for participation.

While the primary goal - create curated package sets - is obvious, the secondary goals are not. In fact, many of them only really became clear to me in 20-20 hindsight:

  • Require as little maintenance as possible. Stackage should be as much an automated process as can be created, since human time is a valuable, scarce resource. In other words: I'm lazy :).

  • Require as little change in behavior from package authors as possible. In my opinion, the only reasonable way to bootstrap a project is to make it trivial for people to participate. The barrier to entry for Stackage had to be minimal.

    • Even past the "bootstrapping" phase, a nice quality of any system is requiring little effort on the part of users. Therefore, even today, where Stackage is arguably successful and well-established, this goal still applies.
  • It needed to work well with existing tooling. In 2012, the Stack project hadn't even been dreamt up yet, so figuring out a way to work with cabal-install (via the cabal.config file) was vital. Compatibility with cabal-install is still a nice thing today, but not nearly as vital as it was then.

  • We need to maximize the number of packages that can be included in a single snapshot. The two ways in which two packages can be incompatible are:

    • There is an actual incompatibility in the API, such as a function being removed or its type signature changed in a new release of a dependency.

    • There is a stated upper or lower bound in a package which precludes a build plan, but the code itself would actually compile. (This is the case --allow-newer is designed for.)

Initial choices

Based on these goals, I created the initial version of Stackage. While many decisions came into play (e.g., what file format should we use to let package authors submit packages?), I'm going to focus on the interesting choices that fell out of the goals above, and which today may be noteworthy.

  • As I'd learnt from maintaining Yesod, many Windows users in particular were using the Haskell Platform (HP), and trying to specify different versions of packages from what HP provided could cause problems. Therefore, it was important to keep compatibility with the Haskell Platform set of packages. This resulted in multiple builds of Stackage: a "current GHC", "previous GHC", and "Haskell Platform superset."

  • We should always try to take the latest available version of a package, as it may include bug fixes, feature enhancements, and generally because the Haskell community loves the bleeding edge :). However, there would be cases where a new version of a package caused enough breakage to warrant holding it back, so some concept of enforced upper bounds was necessary too.

  • It was theoretically possible to ignore version bound information in cabal files, and instead ensure compatibility based on compiling and running test suites. However, this would have some serious downsides:

    • Users would have regularly needed to run builds with --allow-newer
    • If there were non-API-breaking semantic changes in a package, a version bound was present to avoid those changes, and there was no test suite to cover that behavior, ignoring bounds would cause those semantic changes to slip in (in my experience, this is an exceedingly rare case, but it can happen)
    • It's arguably very confusing behavior that a package set specifies versions of packages which claim to be incompatible with each other

    Therefore, version bounds needed to be respected. However...

  • Due to the frequency of overly restrictive version bounds and trivial compatibility patches which were slow to make it upstream, Stackage allowed for locally modified packages. That means that, for example, Stackage snapshot foo could have a different set of code associated with mtl-2.2.1 than what Hackage reports. Note that this feature was more aggressive than Hackage cabal file revisions, in that it allowed the code itself to change, not just the cabal file.

These decisions lasted for (IIRC) about a year, and were overall successful at letting Stackage become a thriving project. I was soon able to shut down the Yesod Platform initiative in favor of Stackage, which was a huge relief for me. At this point, outside of the Yesod community, I think Stackage was viewed mostly as a "ecosystem-wide CI system" than something for end users. It wasn't until Stack defaulted to Stackage snapshots that end users en masse started using Stackage.

Changes over time

Stackage today is quite a bit different from the above decisions:

  • I eventually dropped the Haskell Platform superset. There was a time when that package set wasn't updated, and the complication of trying to find a compatible set of packages on top of it was simply too high. In addition, HP included a version of aeson with a significant security hole (DoS attack with small inputs), and continuing to supply such a package set was not something I felt comfortable doing.

  • Due to the burden of maintaining bleeding-edge Stackages for multiple GHC versions - both on myself as the curator and on package authors - I also dropped support for older GHC releases. Instead, I introduced LTS Haskell, which keeps compatibility with older GHCs without adding (significant) package author burden.

  • When working on the GPS Haskell collaboration, I removed support for locally modified packages. This was done due to requests from the Hackage and Haskell Platform maintainers, who wanted a single definition of a package. With this change, unresponsive package maintainers can really hold things up in Stackage. However, this overall led to a number of simplifications in code, and ultimately allowed for better binary cache support in Stack. So despite the initial pain, I think this was a good change.

  • Hackage revisions make it possible for a package set to contain packages which are no longer compatible by their latest cabal files. Therefore, we needed to add support to Stackage to track which version of a cabal file was included in a snapshot, not just the version of the package itself. I only mention this here because it weakens our previous decision to respect cabal file constraints due to avoiding user confusion.

  • We have an expanded team! I'm happy to say that I am now one of five Stackage curators, and no longer have to either handle all the work myself, or make unilateral decisions. In other words, I get to share the blame with others :). Many thanks to Adam Bergmark, Dan Burton, Jens Petersen, and our newest member, Luke Murphy.

Changes to consider today

Alright, this post has turned out way longer than I'd expected, apologies for this. I guess there was more decision making that occurred than I'd realized. Anyway, I hope that gives some context for where things are at today. Which brings us to the original discussion that brought this whole blog post into existence: should we be changing anything about Stackage? Here are some changes either proposed by others or that I've thought of, and some remarks.

  • The curator team overall has been pretty lax about booting packages that block newer versions of dependencies. There have definitely been calls for us to be more proactive about that, and aggressively kick out packages that are holding back dependencies.

    • Pros: Stackage Nightly will live up to its bleeding edge mission statement more effectively, we'll overall have less incidental pain on package authors who are staying up to date with their dependencies.

    • Cons: it will decrease the number of packages in Stackage Nightly for end users, and adds extra burden on package authors to be more quick to respond to requests.

  • As a relaxed version of the above: be stricter with package authors, but only in the case of cabal file upper bounds. The argument here is stronger, since the work required is fairly minimal, and - at least in my experience - waiting for relaxed upper bounds is what takes up a lot of the time when curating. An extreme version of this is demanding that upper bounds just be removed.

  • Or an interesting alternative to that: should Stackage simply ignore constraints in cabal files entirely? It would be fairly easy to extend Stack to recognize a flag in snapshots to say "ignore the constraints when building," or even make that the default behavior.

    • Pros: less time spent on bounds issues, Stackage doesn't get held back by trivial version bounds issues, for PVP bounds enthusiasts could encourage people to add bounds during upload more often (not sure of that).

    • Cons: cabal users with Stackage snapshots wouldn't have as nice a time, it could be confusing for users, and if the upper bounds are in place due to semantic changes we won't catch it.

  • Since GPS Haskell isn't happening, we could add back the ability for the Stackage curator team to modify packages (both cabal files and source files). I think the pros and cons of this were pretty well established above, I'm not going to repeat it here.

  • People have asked for running multiple nightly lines with different GHC versions.

    • Pros: instead of haven't slightly outdated LTS versions for older GHCs, we'd have bleeding edge all over again.

    • Cons: we'd need new naming schemes for snapshots, a lot more work for the curator team, and potentially a lot more work for package authors who would need to maintain further GHC compatibility with their most recent releases.

  • I've had some private discussions around this, and thought I should share the idea here. Right now, Stackage requires that any package added must be available on Hackage. A number of newer build systems have been going the route of allowing packages to be present only in a Git repository. Stack has built-in support for specifying such locations, but snapshots do not support it. Should we add support to Stackage to allow packages to be pulled from places besides Hackage?

    • Pros: knocks down another barrier to entry for publishing packages.

    • Cons: Stackage snapshots will not automatically work with cabal-install anymore, extra work to be done to make this functional, and some issues around determining who owns a package name need to be worked out.

There are likely other changes that I haven't mentioned, feel free to raise them in the comments below. Also, if anyone really wants to follow up on these topics, the best place to do that is the Stackage mailing list.

Categories: Offsite Blogs

JP Moresmau: So long Haskell, and thanks for all the functional fish

Planet Haskell - Sat, 01/21/2017 - 5:05am
I've realized I haven't written or read a line of Haskell in the past 6 months. After roughly ten years of tinkering with it, it seems that I've given up on it. There is a big obvious reason, and other smaller ones.

Professionally, my job got a lot more challenging (I'm officially an **architect** now, which means I still write lots of code but I have to draw pretty diagrams too (-: ) and involves a lot of research and learning new stuff, things like microservices, docker, messaging, Angular2, mobile apps, etc. So a lot of my time is dedicated to work or to learning for work, so I don't have the time to play around with something quite unrelated like Haskell.

I suppose to be honest there also was a bit of lassitude with Haskell. I got tired of the less than optimal IDEs, I realized Haskell was not going to get me a great job, and there were a few little skirmishes on the web that got ugly and made me see the community in a less favorable light.

This was fun, though, and I certainly learned a lot, and I hope it has made me a better programmer. A lot of my coding now is done in Java 8, and it's good to be able to apply some functional idioms practiced in Haskell, and more general ideas like data immutability, small pure functions do help make better - more testable, easier to understand - code.

So maybe I'll come back to Haskell some day, but not for now. To all the open source projects I've contributed, I wish you the best!

Happy Haskell (or any other language) Hacking!

Categories: Offsite Blogs

Joachim Breitner: Global almost-constants for Haskell

Planet Haskell - Fri, 01/20/2017 - 12:03pm

More than five years ago I blogged about the “configuration problem” and a proposed solution for Haskell, which turned into some Template Haskell hacks in the seal-module package.

With the new GHC proposal process in plase, I am suddenly much more inclined to write up my weird wishes for the Haskell language in proposal form, to make them more coherent, get feedback, and maybe (maybe) actually get them implemented. But even if the proposal is rejected it is still a nice forum to discuss these ideas.

So I turned my Template Haskell hack into a proposed new syntactic feature. The idea is shamelessly stolen from Isabelle, including some of the keywords, and would allow you to write

context fixes progName in foo :: Maybe Int -> Either String Int foo Nothing = Left $ progName ++ ": no number given" foo (Just i) = bar i bar :: Int -> Either String Int bar 0 = Left $ progName ++ ": zero no good" bar n = Right $ n + 1

instead of

foo :: String -> Maybe Int -> Either String Int foo progName Nothing = Left $ progName ++ ": no number given" foo progName (Just i) = bar progName i bar :: String -> Int -> Either String Int bar progName 0 = Left $ progName ++ ": zero no good" bar progName n = Right $ n + 1

when you want to have an “almost constant” parameter.

I am happy to get feedback at the GitHub pull request.

Categories: Offsite Blogs

Douglas M. Auclair (geophf): December 2016 1HaskellADay 1Liners

Planet Haskell - Thu, 01/19/2017 - 6:13pm
  • December 22nd, 2016:  f :: (Either a b, c) -> Either (a, c) (b, c), define f, snaps for elegance, e.g.: f (Left 4, "Hi") = Left (4, "Hi")
    • bazzargh @bazzargh uncurry (flip (join bimap . (,) ))
      • Denis Stoyanov @xgrommx need (Left 4, "Hi") = Left (4, "Hi") but your version Left ("Hi", 4)
    • Thomas D @tthomasdd Do tuple sections count? do I have access to Data.Bifunctor?
      • f (eab,c) = bimap (,c) (,c) eab
    • SocialJusticeCleric @walkstherain uncurry $ either ((Left .).(,)) ((Right .).(,))
    • Denis Stoyanov @xgrommx or f (e, a) = (join bimap (\x -> (x, a))) e
    • Nickolay Kudasov @crazy_fizruk most elegant IMO:
      f (Left a, c) = Left (a, c)
      f (Right b, c) = Right (b, c)
  • December 22nd, 2016: define a function that writes out an infinite, alternating stream of 1's and 0's as below. 
    • Philipp Maier @AkiiZedd mapM putStrLn $ join $ repeat ["0","1"]
      • Eyal Lotem @EyalL join . repeat = cycle?
    • mavant @mavant f = putStr "10" >> f
    • Eyal Lotem @EyalL mapM putStrLn $ cycle ["0","1"]
  • December 10th, 2016:
    startsWith :: [String] -> String
    points-free so that:
    startsWith ["ΜΗΛΟΝ", "ΗΔΟΝΗ"] = "ΛΟ"
    That is: (length list)+1 Char of each word
    • SocialJusticeCleric @walkstherain 
      • I prefer `uncurry (!!) . (Data.List.transpose &&& length)`
      • but `map . flip (!!) . length =<< id` only uses the Prelude
    • Nick @crazy_fizruk zipWith (!!) <*> repeat . length
Categories: Offsite Blogs

Michael Snoyman: Follow up on mapM_

Planet Haskell - Wed, 01/18/2017 - 6:00pm

This is a short follow-up to my blog post about mapM_ and Maybe. Roman Cheplyaka started a discussion on that post, and ultimately we came up with the following implementation of mapM_ which works for all Foldables and avoids the non-tail-recursive case for Maybe as desired:

mapM_ :: (Applicative m, Foldable f) => (a -> m ()) -> f a -> m () mapM_ f a = go (toList a) where go [] = pure () go [x] = f x -- here's the magic go (x:xs) = f x *> go xs

Why is this useful? If you implement mapM_ directly in terms of foldr or foldMap, there is no way to tell that you are currently looking at the last element in the structure, and therefore will always end up with the equivalent of f x *> pure () in your expanded code. By contrast, with explicit pattern matching on the list-ified version, we can easily pattern match with go [x] and avoid *> pure () bit, thereby making tail recursion possible.

Some interesting things to note:

  • Using () <$ f x instead of f x *> pure () or f x >> return () seemed to make no difference for tail recursion purposes.
  • As a result of that, we still need to have the ()-specialized type signature I describe in the previous blog post, there doesn't seem to be a way around that.
  • As you can see from the benchmark which I unceremoniously ripped off from Roman, there do not appear to be cases where this version has more memory residency than mapM_ from base. Roman had raised the concern that the intermediate list may involve extra allocations, though it appears that GHC is smart enough to avoid them.

Here are the results. Notice the significantly higher residency numbers for base:

5000 roman 36,064 bytes 5000 michael 36,064 bytes 5000 base 36,064 bytes 50000 roman 36,064 bytes 50000 michael 36,064 bytes 50000 base 133,200 bytes 500000 roman 44,384 bytes 500000 michael 44,384 bytes 500000 base 2,354,216 bytes 5000000 roman 44,384 bytes 5000000 michael 44,384 bytes 5000000 base 38,235,176 bytes

My takeaway from all of this: it's probably too late to change the type signature of mapM_ and forM_ in base, but this alternative implementation is a good fit for mono-traversable. Perhaps there are some rewrite rules that could be applied in base to get the benefits of this implementation as well.

Completely tangential, but: as long as I'm linking to pull requests based on blog posts, I've put together a PR for classy-prelude and conduit-combinators that gets rid of generalized I/O operations, based on my readFile blog post.

Categories: Offsite Blogs

FP Complete: Speeding up a distributed computation in Haskell

Planet Haskell - Wed, 01/18/2017 - 9:10am

While helping a client ship a medical device we were tasked to make its response time bearable. This was no easy feat, given that each request to this device requires running a simulation that takes hours if ran on a single CPU. This long response time would make it impossible for doctors to use this device interactively, which in turn would make the device much less desirable -- think of a doctor having to wait hours between inputting the patient data and getting results, as opposed to getting results immediately as the data is available.

Luckily the simulations in question are embarrassingly parallel, and thus one obvious path to reduce the response time is to run it on multiple CPUs.

At the core of this device sits a Haskell program that performs the simulation. Thus the first step was to exploit Haskell built-in multi-core parallelism to achieve the parallelization. However the results were unsatisfactory, since we were unable to scale decently beyond 7 to 10 CPUs. Thus we created a custom distribution algorithm where separate Haskell runtimes communicate with TCP sockets, similar to what happens in Erlang. This also allowed us to scale beyond a single machine. We've described this effort in the past, see the report Scaling Up a Scientific Computation and the talk Parallelizing and distributing scientific software in Haskell.

This first effort allowed us to run simulations in a much shorter time, but it still did not allow us to scale nicely to hundreds of CPUs. This article describes how we fixed that by bypassing one of the high level facilities that Haskell provides.

High level languages are all about offering such facilities, to be able to write correct programs quicker. Haskell offers a great number of abstractions to help in this regard, such as garbage collection and laziness, and GHC also is full of tools on top of the language itself to write an ever greater number of programs in a more comfortable way.

One of the features that makes GHC stand out is the sophistication of the runtime it provides. Apart from being an impressive piece of work even just for implementing Haskell efficiently, it also offers features that are very useful for the kind of systems programming that writing a distributed application requires. Specifically, green threads and the GHC event manager make writing a fast multi-threaded server much easier than in other languages. For example the first versions of Warp, Haskell's most popular web server, outperformed most web servers in just 500 lines of code, largely thanks to these facilities -- you can find more info about this effort in the report Warp: A Haskell Web Server. Warp has since grown in code size to add new features, but the core is still using the same facilities and performing well.

Since the core of the software that we built is a server coordinating the work of many slaves, for our first version we reached for these facilities to write it. The server was reasonably fast and served us for a while, but we hit a ceiling pretty quickly beyond which we were unable to scale.

However, a nice thing about GHC Haskell is that it's very easy to drop down to a lower level programming style when needed. This can be accomplished through the excellent foreign function interface to C paired with the low-level utilities in base. By doing so we were able to scale to hundreds of cores and run simulations up to 5 times faster then the best time we achieved with the previous version.

The program

As mentioned, the server in question is the master process in a distributed computing application. The application is essentially a particle filter, distributed across many processes which might be on different machines. Since we want multi-machine distribution, we use TCP sockets to communicate between the processes doing the computation.

At the core of the program logic we have a function taking some State and some Input, and generating some new states and an output associated with each one:

type Evolve = State -> Input -> [(State, Output)]

Note that a single state and input pair generates multiple states and output. The multiple outputs are due to the fact that in a particle filter each state (or rather each "particle") can be sampled 0 or multiple times. We need to run one such function on thousands of inputs:

-- Apply the `Evolve` to every given `State`, return -- the new states and output. evolveMany :: Evolve -> [State] -> [Input] -> [[(State, Output)]] evolveMany f = zipWith f

Given this initial specification, there are a couple of adjustments we need to make if we want to be able to distribute the computation. First, the function will have to live in IO, since communication will happen through Sockets. Second, we won't refer to the states directly, but rather refer to them using tokens provided by the system. At the beginning we'll provide the initial states and get back tokens in result, and at each call to evolveMany we'll get -- instead of new States -- new tokens.

We can do this because we do not care about the content of the states (while we care about the outputs) and referring to them with tokens rather than directly we can avoid transferring them to other processes each time we need to operate on them, saving a lot of bandwidth and speeding up the computation greatly.

Thus, we'll also need to book-keep which slave processes are holding which state.

Finally, we'll need Sockets to communicate with the slave processes.

This gives us a new API:

-- We use `Map` and `Set` from `containers` for illustrative purposes, `HashMap` -- from `unordered-containers` or a mutable hash table from `hashtables` -- will most likely be more performant. import Data.Map (Map) import qualified Data.Map as Map import Data.Set (Set) import qualified Data.Set as Set -- Some token representing a `State` on some slave. data StateId -- Some token representing a slave. data SlaveId -- Reset the states in the system to the given ones, returns a -- 'StateId' for each state. resetStates :: Map SlaveId Socket -- Connection to the slaves -> [State] -> IO (Map SlaveId (Set StateId), [StateId]) -- Returns how the states have been repartitioned on the slaves -- and a list to know which `StateId` corresponds to which `State`. -- Evolves the states with the given inputs, returns the outputs and -- the new 'StateId's resulting from the evolution. evolveMany :: Map SlaveId Socket -- Connections to the slaves -> Map SlaveId (Set StateId) -- Which states are on which slave -> Map StateId Input -- Inputs to each state -> IO (Map SlaveId (Set StateId), Map StateId [(StateId, Output)]) -- Returns the new mapping from slaves to states, and -- the outputs.

When using this API, the usual pattern is to call resetStates at the beginning with the initial states and then a series of evolveMany afterwards, each using the StateIds returned from resetStates the first time and evolveMany afterwards.

The challenge is to implement evolveMany as efficiently as possible.

To give an idea of the time involved, we usually have around 2000 states, a few tens of calls to evolveMany, and each call to Evolve takes a few tenths of seconds to complete, giving a single-threaded run time of a few hours, e.g.

2000 * -- Number of states 80 * -- number of calls to evolveMany 0.03s = -- Evolve time 1h 20m -- Total running timeHigh level overview of the implementation

resetStates just assigns a unique StateId to each state, and then splits up and uploads the states evenly between the slaves.

All the complexity lies in evolveMany: the goal is to utilize the slaves as efficiently as possible.

We found pretty early that naively evolving the states present on each slave would not work, because:

  • Each call to Evolve results in many (possibly 0) children states (since the return type is a list), and we cannot predict how many we'll get in advance. This would cause different slaves to have a different number of states after a few calls to evolveMany, which in turn would cause the slaves to not be used efficiently, since some would end up being idle;
  • The runtime of an individual Evolve depends on the state and on the input, and we cannot predict it. This also can cause some slaves to finish earlier than others, causing inefficiencies.

More concretely, imagine a situation with 10 states, where 9 of the states take 1 second while there is an odd state that takes 10 seconds. If we have 2 slaves at our disposal, the most efficient distribution is to assign the slow state to one slave, and all the others to another slave, with one slave taking 10 seconds and the other taking 9. If we just distribute the states evenly between the slaves, 1 slave will take 14 seconds and one 5. Since the total runtime will be constrained by the slowest slave, we must be careful to avoid such long tails.

So we switched to a simple but effective method to utilize the slaves efficiently. The master process keeps track of the states present on each slave, and asks the slaves to process them in batches, say of 5. When a slave finishes its batch, it sends the output back to the master and waits for further instructions. If the slave still has states to evolve, the master sends a request for a new batch to be evolved. If the slave does not have states to update the master will search for a slave with states to spare, and request them. When a slave receives such a request it sends back the states to the master, which will forward them to the needy slave. When there are no more states to update, evolveMany is done.

The algorithm can be summed up as two state machines, one for the master and one for the slave:

-- This is what the master sends to the slave. data Request -- Evolve the selected states = EvolveStates [StateId] -- Add the given states | AddStates [(StateId, State)] -- Remove the requested states, and return them to the master | RemoveStates [StateId] -- This is what the slaves reply to the master. data Response = StatesEvolved [(StateId, [(StateId, Output)])] | StatesAdded | StatesRemoved [(StateId, State)] -- The slave has a set of `State`s indexed by `StateId`, and it updates -- it at each request from the master. slaveStateMachine :: Map StateId State -> Request -> (Map StateId State, Response) -- Some type to refer to slaves uniquely. data SlaveId -- The master keeps track of which states each slave has, and will update -- it. It also records the outputs we have received from the slaves so far. data MasterState = MasterState { msSlavesStates :: Map SlaveId (Set StateId) , msStatesToEvolve :: Map StateId Input , msEvolvedStates :: Map StateId [(StateId, Output)] } -- At each response from a slave the master updates its state and then -- might reply with a new `Request`. Note that the `Request` might not -- be directed at the same slave that sent the `Response`, since sometimes -- we need to steal slaves from other slaves since the slave at hand does -- not have states to update. masterStateMachine :: MasterState -> SlaveId -> Response -> (MasterState, Maybe (SlaveId, Request))

The most common pattern of interaction between slave and master will be of a loop of EvolveStates and StatesEvolved:

This interaction between slave and master will continue until one slave will runs out of states to evolve. In that case, the master will have to reach out to some other slave to be able to provide the needy slave with something to evolve. For example, this is what will happen if slave 3 runs out of states and the master decides to ship some states to it from slave 2:

The exact implementation of the state machines is not relevant, but given their types what's to note is that:

  • The slave will be a very simple loop that just waits for a request, processes it, and then replies to the master.
  • The master, on the other hand, is a bit more complicated: it needs to wait for responses from any slave, which means that we'll have to multiplex over multiple channels; and then it can reply to any slave.
First attempt, and performance

Now that we have abstracted out the logic of the master and the slaves in self-contained state machines, we can describe the slave and master processes. We'll assume IO functions to send and receive messages.

The slave implementation is trivial and won't change:

-- These functions will use `recv`/`send` to work with the `Socket`s, -- and the `store` library to efficiently deserialize and serialize -- the requests and responses. receiveRequest :: Socket -> IO Request sendResponse :: Socket -> Response -> IO () slave :: Socket -- Connection to master -> IO a slave sock = loop mempty -- No states at the beginning where loop :: Map StateId State -> IO (Map StateId State) loop states = do req <- receiveFromMaster sock (states', resp) <- slaveStateMachine states req sendToMaster sock resp

Note that a slave process is not bound to a single call to evolveMany, it just takes requests from a master.

The master on the other hand is essentially the implementation of evolveMany, and we have a lot more options to implement it. Our first version is a pretty idiomatic Haskell program, using one thread per slave so that we can wait on all of them at once, with the master state stored in an MVar that can be accessed from all the slave threads:

Each slave thread will run code waiting on a slave, modifying the shared state using the master state machine:

import Control.Concurrent.MVar receiveResponse :: Socket -> IO Response sendRequest :: Socket -> Request -> IO () -- Terminates when there is nothing left to do. slaveThread :: Map SlaveId Socket -> MVar MasterState -> SlaveId -> IO () slaveThread slaveSockets masterStateVar slaveId = do resp <- receiveResponse (slaveSockets Map.! slaveId) (masterState, mbReq) <- modifyMVar masterStateVar $ \masterState -> let (masterState', mbReq) = masterStateMachine masterState slaveId resp return (masterState', (masterState', mbReq)) -- Send the request if needed mapM_ (\(slaveId, req) -> sendRequest (slaveSockets Map.! slaveId) req) mbReq -- Continue if there are still slates to evolve unless (Map.null (msStatesToEvolve masterState)) $ slaveThread masterStateVar slaveId -- Runs the provided actions in separate threads, returns as -- soon as any exists raceMany_ :: [IO ()] -> IO () raceMany_ xs0 = case xs0 of -- `race_` is from the `async` package. [] -> return () [x] -> x x : xs -> race_ x (raceMany_ xs) evolveMany :: Map SlaveId Socket -> Map SlaveId (Set StateId) -> Map StateId Input -> IO (Map SlaveId (Set StateId), Map StateId [(StateId, Output)]) evolveMany slaveSockets slaveStates inputs = do masterStateVar <- newMVar MasterState { msSlavesStates = slaveStates , msStatesToEvolve = inputs , msEvolvedStates = mempty } -- Run one thread per slave until one receives the response -- after which there are no states to evolve raceMany_ (map (slaveThread masterStateVar) (Map.keys slaveStates)) -- Return the results in the `MasterState` masterState <- readMVar masterStateVar return (msSlavesStates masterState, msEvolvedStates masterState)

This implementation is simple and quite obviously correct, and it's also pretty fast. In fact, we were able to scale up to around 20 slaves quite well with it:

Note that both axes for this and every other plot in this article are logarithmic: if we scaled perfectly we'd get a straight line, which we're pretty close to.

However, things go downhill if we try to scale beyond 20 slaves. Here is a sample of the runtime with up to 450 slaves for six different scenarios:

These measurements were all taken on clusters of c4.8xlarge AWS instances with 18 physical cores, with up to 30 machines running at once. The benchmarking was automated using terraform, which was invaluable when evaluating the improvements.

It's evident that the distribution does not scale beyond around 40 slaves, and stalls completely between 50 and 100 slaves, after which adding slaves is detrimental to the runtime. Note that for the scenarios taking more time the scaling is better: this is because for those scenarios each individual call to the Evolve function takes longer, and thus the overhead of the distribution is less substantial. This is the case for scenario D, which starts out being the slowest with 17 slaves, taking more than 4000 seconds rather than 800-1000 seconds, but scaling much better.

From this data it was clear that if we wanted to be able to leverage a large number of machines to run our simulations in a minute or less we had to improve the performance of evolveMany.

Small aside: note how these plots contains a line "with taskset" and one without, with the one without performing noticeably worse. The line with taskset indicates measurements taken where each Haskell process is pinned to a physical CPU core: this improves performance substantially compared to letting the kernel schedule them.[^runtimes] After finding this out we ran all subsequent tests pinning slave processes to physical cores. Hyperthreading was also detrimental to the runtime, since the increased distribution overhead far outweighed the gained CPU time; so we used only one process per physical CPU core and avoided hyperthreading. Keep in mind that since we're distributing the work manually using TCP sockets each slave is a separate OS process that runs a dedicated Haskell runtime, which is why it makes sense to pin it to a single core.

Second attempt

By measuring how much time each slave spent working and how much time it spent waiting for instructions from the master, it became clear that the program was getting slower because the slaves spent more and more time waiting for instructions, rather than actually working. Thus, if we wanted proper scaling, we needed to lower the latency between the time a response reached the master and the time the slave received the next request.

Now, we tried to gain conclusive evidence of why our first version of evolveMany is slow, but profiling these sort of applications is quite hard unless you're intimately familiar with the Haskell runtime -- which is almost like saying "if you are Simon Marlow".

We had however some hypotheses of why our program was slow. One possibility is that the event manager can simply not handle hundreds of connections at the same time efficiently, at least in our use case.

Another suspicion is that the multi-threadedness of the first version played at our disadvantage since there would be a lot of pointless context-switches while one thread was already modifying the MVar MasterState. In other words, any context switch between slave threads while one slave thread is already holding the MVar MasterState is (almost) wasted, since it'll be blocked on the MVar MasterState right after receiving a slave response and will yield, delaying the completion of the loop body in the thread that was already processing the MasterState.

While our second version was based on these hypotheses we were quite short on time and did not want to take the risk of rewriting the program to find that we still could not scale as we desired. Thus, we set ourselves to write the fastest possible version of evolveMany that we could think of.

The main change we wanted was to turn the server from a multi-threaded server multiplexing through the event manager to a single-threaded application multiplexing the sockets directly.

In Linux, the epoll set of syscalls exist for this exact reason: you can register multiple sockets to wait on with epoll_ctl, and then wait for any of them to be ready using epoll_wait.

However in Haskell epoll is abstracted over by the GHC event manager, so there is no library to use these facilities directly. The GHC event manager does offer an interface to it in the form of GHC.Event.registerFd. However all these functions are callback based -- they take a function that will be called in a green thread when the socket is ready. Thus we cannot easily write a single threaded program directly using it. If we want to write a single-threaded loop we're forced to go through an additional synchronization primitive such an MVar to signal that a socket is ready to be read from in the callback provided to registerFd. Note that the normal blocking read for Haskell sockets is implemented using threadWaitRead, which uses registerFd in exactly this way, by having the callback to fill in an MVar that threadWaitRead will wait on. We tried this approach and got no performance improvement.

Thus we decided to just write the loop using epoll directly, which proved very painless given that the GHC codebase already contains bindings to the epoll functions, as part of the event manager. We released a simple library for people that need to do the same, simple-poll. Right now it only supports epoll, and is thus limited to Linux, but it should be easy to extend to other platforms by copy-pasting other bits of code from the GHC event manager.

Updating the old loop to an explicit multiplexing style, we have:

-- `System.Poll.EPoll` comes from the `simple-poll` package import System.Poll.EPoll (EPoll) import qualified System.Poll.EPoll as EPoll import Network.Socket (Socket(MkSocket)) import System.Posix.Types (Fd(Fd)) -- Receives first responses to arrive from any of the slaves. -- This amounts to calling `EPoll.wait` to get back a list of -- sockets to read from, and then draining them in turn to -- decode the `Response`. -- -- Note that draining them might still not give us a response, -- since the full response might not be available all at once, -- and thus in the full version of the software this function will have -- to hold somes state holding partially read messages. -- -- Also note that in the real software it's much better to return -- a list of `(SlaveId, Response)` pairs. We have it return only -- one for simplicity. receiveFromAnySlave :: EPoll -> Map Fd SlaveId -- Reverse lookup table from `Fd`s to `SlaveId`s. We need it -- since `EPoll.wait` gives us the `Fd`s which are ready to -- be read from, and from that we need to get back which -- `SlaveId` it corresponds to, to return it. -> IO (SlaveId, Response) -- Utility to get a file descriptor out of a `Socket` socketFd :: Socket -> Fd socketFd (MkSocket fd _ _ _ _) = Fd fd evolveMany :: Map SlaveId Socket -- All the connections to the slaves -> Map SlaveId (Set StateId) -- The states held by each slave -> Map StateId Input -- The inputs to each state -> IO (Map SlaveId (Set StateId), Map StateId [(StateId, Output)]) evolveMany slaveSockets slaveStates inputs = EPoll.with 256 $ \epoll -> do -- First register all the sockets with `epoll_ctl`. `epollIn` is to -- indicate that we want to be notified when a socket can be read from. forM_ slaveSockets $ \socket -> EPoll.control epoll Epoll.controlOpAdd (socketFd socket) EPoll.epollIn -- Then start the event loop masterState <- loop epoll MasterState { msSlavesStates = slaveStates , msStatesToEvolve = inputs , msEvolvedStates = mempty } return (msSlavesStates masterState, msEvolvedStates masterState) where fdToSlaveIds :: Map Fd SlaveId fdToSlaveIds = Map.fromList [(socketFd sock, slaveId) | (slaveId, sock) <- Map.toList slaveSockets] loop :: EPoll -> MasterState -> IO (Map StateId [(StateId, Output)]) loop epoll masterState = do -- Get a response from some slave (slaveId, resp) <- receiveFromAnySlave epoll slaveSockets -- Update the state accordingly let (masterState', mbResp) = masterStateMachine masterState slaveId resp -- Send the new requests mapM_ (uncurry sendToSlave) mbResp -- Continue if we're not done unless (Map.null (msStatesToEvolve masterState')) (loop masterState')

Once we did this, the performance increased dramatically, fulfilling our current scaling needs and probably getting quite close to optimal scaling for our use case, although we have not researched what more margin for improvements we have since we do not need them for now.

Going back to the original set of plots, the blue line shows the improved performance with our second implementation:

The plots clearly show a much nicer scaling pattern as the number of slaves increases, and runtimes of often 100 seconds of less, which represent a 2x to 5x improvement compared to the first version.

We also integrated other micro optimizations that yielded less substantial improvements (in the 5 to 10%) range, such as

  • Using mutable hashtables instead of unordered-containers for most of the bookkeeping.
  • Reading from the Socket directly into a ByteBuffer and deserializing directly from there rather than copying into intermediate ByteStrings, reducing allocations drastically to perform deserialization, since we allocate the buffer where the socket data is read into upfront.

Our biggest takeaway from this experience is that in Haskell we can have the confidence that we'll always be able to write the task at hand to be as fast as possible with relative ease. Writing the epoll based version took around a day, including factoring out the bindings from the GHC event manager into a library.

Moreover, it's important to remember that the normal facilities for fast IO in Haskell (green threads + transparent evented IO) is fast enough for the overwhelming majority of cases, and much easier to manage and think about than manual evented IO. Michael Snoyman recently compared green threads to garbage collection, an apt comparison. Our software is one of the cases where the abstraction prevents performance, and thus we need to work without it.

Finally, it would be great to gain hard evidence on why the first program was slow, rather than just hypotheses. We tried quite hard to understand it but could not reach conclusive evidence in the time we had. We hope to get to the bottom of this issue when we have the time, and maybe make profiling these kind of programs easier in the meantime.


The work described was performed with Philipp Kant and Niklas Hambüchen. Thanks to Michael Snoyman, Philipp Kant, and Niklas Hambüchen for reviewing drafts of this blog post.

Categories: Offsite Blogs

Edward Z. Yang: Try Backpack: Cabal packages

Planet Haskell - Tue, 01/17/2017 - 10:17pm

This post is part two of a series about how you can try out Backpack, a new mixin package system for Haskell. In the previous post, we described how to use a new ghc --backpack mode in GHC to quickly try out Backpack's new signature features. Unfortunately, there is no way to distribute the input files to this mode as packages on Hackage. So in this post, we walk through how to assemble equivalent Cabal packages which have the same functionality.

Download a cabal-install nightly

Along with the GHC nightly, you will need a cabal-install nightly to run these examples. Assuming that you have installed hvr's PPA already, just aptitude install cabal-install-head and you will get a Backpack-ready cabal-install in /opt/cabal/head/bin/.

Otherwise, you will need to build cabal-install from source. I recommend using a released version of GHC (e.g., your system GHC, not a nightly) to build cabal-install.

Where we are going

Here is an abridged copy of the code we developed in the last post, where I have removed all of the module/signature contents:

unit str-bytestring where module Str unit str-string where module Str unit regex-types where module Regex.Types unit regex-indef where dependency regex-types signature Str module Regex unit main where dependency regex-types dependency regex-indef[Str=str-string:Str] (Regex as Regex.String) dependency regex-indef[Str=str-bytestring:Str] (Regex as Regex.ByteString) module Main

One obvious way to translate this file into Cabal packages is to define a package per unit. However, we can also define a single package with many internal libraries—a new feature, independent of Backpack, which lets you define private helper libraries inside a single package. Since this approach involves less boilerplate, we'll describe it first, before "productionizing" the libraries into separate packages.

For all of these example, we assume that the source code of the modules and signatures have been copy-pasted into appropriate hs and hsig files respectively. You can find these files in the source-only branch of backpack-regex-example

Single package layout

In this section, we'll step through the Cabal file which defines each unit as an internal library. You can find all the files for this version at the single-package branch of backpack-regex-example. This package can be built with a conventional cabal configure -w ghc-head (replace ghc-head with the path to your copy of GHC HEAD) and then cabal build.

The header of the package file is fairly ordinary, but as Backpack uses new Cabal features, cabal-version must be set to >=1.25 (note that Backpack does NOT work with Custom setup):

name: regex-example version: build-type: Simple cabal-version: >=1.25

Private libraries. str-bytestring, str-string and regex-types are completely conventional Cabal libraries that only have modules. In previous versions of Cabal, we would have to make a package for each of them. However, with private libraries, we can simply list multiple library stanzas annotated with the internal name of the library:

library str-bytestring build-depends: base, bytestring exposed-modules: Str hs-source-dirs: str-bytestring library str-string build-depends: base exposed-modules: Str hs-source-dirs: str-string library regex-types build-depends: base exposed-modules: Regex.Types hs-source-dirs: regex-types

To keep the modules for each of these internal libraries separate, we give each a distinct hs-source-dirs. These libraries can be depended upon inside this package, but are hidden from external clients; only the public library (denoted by a library stanza with no name) is publically visible.

Indefinite libraries. regex-indef is slightly different, in that it has a signature. But it is not too different writing a library for it: signatures go in the aptly named signatures field:

library regex-indef build-depends: base, regex-types signatures: Str exposed-modules: Regex hs-source-dirs: regex-indef

Instantiating. How do we instantiate regex-indef? In our bkp file, we had to explicitly specify how the signatures of the package were to be filled:

dependency regex-indef[Str=str-string:Str] (Regex as Regex.String) dependency regex-indef[Str=str-bytestring:Str] (Regex as Regex.ByteString)

With Cabal, these instantiations can be specified through a more indirect process of mix-in linking, whereby the dependencies of a package are "mixed together", with required signatures of one dependency being filled by exposed modules of another dependency. Before writing the regex-example executable, let's write a regex library, which is like regex-indef, except that it is specialized for String:

library regex build-depends: regex-indef, str-string reexported-modules: Regex as Regex.String

Here, regex-indef and str-string are mix-in linked together: the Str module from str-string fills the Str requirement from regex-indef. This library then reexports Regex under a new name that makes it clear it's the String instantiation.

We can easily do the same for a ByteString instantiated version of regex-indef:

library regex-bytestring build-depends: regex-indef, str-bytestring reexported-modules: Regex as Regex.ByteString

Tie it all together. It's simple enough to add the executable and then build the code:

executable regex-example main-is: Main.hs build-depends: base, regex, regex-bytestring, regex-types hs-source-dirs: regex-example

In the root directory of the package, you can cabal configure; cabal build the package (make sure you pass -w ghc-head!) Alternatively, you can use cabal new-build to the same effect.

There's more than one way to do it

In the previous code sample, we used reexported-modules to rename modules at declaration-time, so that they did not conflict with each other. However, this was possible only because we created extra regex and regex-bytestring libraries. In some situations (especially if we are actually creating new packages as opposed to internal libraries), this can be quite cumbersome, so Backpack offers a way to rename modules at use-time, using the mixins field. It works like this: any package declared in build-depends can be specified in mixins with an explicit renaming, specifying which modules should be brought into scope, with what name.

For example, str-string and str-bytestring both export a module named Str. To refer to both modules without using package-qualified imports, we can rename them as follows:

executable str-example main-is: Main.hs build-depends: base, str-string, str-bytestring mixins: str-string (Str as Str.String), str-bytestring (Str as Str.ByteString) hs-source-dirs: str-example

The semantics of the mixins field is that we bring only the modules explicitly listed in the import specification (Str as Str.String) into scope for import. If a package never occurs in mixins, then we default to bringing all modules into scope (giving us the traditional behavior of build-depends). This does mean that if you say mixins: str-string (), you can force a component to have a dependency on str-string, but NOT bring any of its module into scope.

It has been argued package authors should avoid defining packages with conflicting module names. So supposing that we restructure str-string and str-bytestring to have unique module names:

library str-string build-depends: base exposed-modules: Str.String hs-source-dirs: str-string library str-bytestring build-depends: base, bytestring exposed-modules: Str.ByteString hs-source-dirs: str-bytestring

We would then need to rewrite regex and regex-bytestring to rename Str.String and Str.ByteString to Str, so that they fill the hole of regex-indef:

library regex build-depends: regex-indef, str-string mixins: str-string (Str.String as Str) reexported-modules: Regex as Regex.String library regex-bytestring build-depends: regex-indef, str-bytestring mixins: str-bytestring (Str.ByteString as Str) reexported-modules: Regex as Regex.ByteString

In fact, with the mixins field, we can avoid defining the regex and regex-bytestring shim libraries entirely. We can do this by declaring regex-indef twice in mixins, renaming the requirements of each separately:

executable regex-example main-is: Main.hs build-depends: base, regex-indef, str-string, str-bytestring, regex-types mixins: regex-indef (Regex as Regex.String) requires (Str as Str.String), regex-indef (Regex as Regex.ByteString) requires (Str as Str.ByteString) hs-source-dirs: regex-example

This particular example is given in its entirety at the better-single-package branch in backpack-regex-example.

Note that requirement renamings are syntactically preceded by the requires keyword.

The art of writing Backpack packages is still in its infancy, so it's unclear what conventions will win out in the end. But here is my suggestion: when defining a module intending to implement a signature, follow the existing no-conflicting module names convention. However, add a reexport of your module to the name of the signature. This trick takes advantage of the fact that Cabal will not report that a module is redundant unless it is actually used. So, suppose we have:

library str-string build-depends: base exposed-modules: Str.String reexported-modules: Str.String as Str hs-source-dirs: str-string library str-bytestring build-depends: base, bytestring exposed-modules: Str.ByteString reexported-modules: Str.ByteString as Str hs-source-dirs: str-bytestring

Now all of the following components work:

library regex build-depends: regex-indef, str-string reexported-modules: Regex as Regex.String library regex-bytestring build-depends: regex-indef, str-bytestring reexported-modules: Regex as Regex.ByteString -- "import Str.String" is unambiguous, even if "import Str" is executable str-example main-is: Main.hs build-depends: base, str-string, str-bytestring hs-source-dirs: str-example -- All requirements are renamed away from Str, so all the -- instantiations are unambiguous executable regex-example main-is: Main.hs build-depends: base, regex-indef, str-string, str-bytestring, regex-types mixins: regex-indef (Regex as Regex.String) requires (Str as Str.String), regex-indef (Regex as Regex.ByteString) requires (Str as Str.ByteString) hs-source-dirs: regex-example Separate packages

OK, so how do we actually scale this up into an ecosystem of indefinite packages, each of which can be used individually and maintained by separate individuals? The library stanzas stay essentially the same as above; just create a separate package for each one. Rather than reproduce all of the boilerplate here, the full source code is available in the multiple-packages branch of backpack-regex-example.

There is one important gotcha: the package manager needs to know how to instantiate and build these Backpack packages (in the single package case, the smarts were encapsulated entirely inside the Cabal library). As of writing, the only command that knows how to do this is cabal new-build (I plan on adding support to stack eventually, but not until after I am done writing my thesis; and I do not plan on adding support to old-style cabal install ever.)

Fortunately, it's very easy to use cabal new-build to build regex-example; just say cabal new-build -w ghc-head regex-example. Done!


If you actually want to use Backpack for real, what can you do? There are a number of possibilities:

  1. If you are willing to use GHC 8.2 only, and you only need to parametrize code internally (where the public library looks like an ordinary, non-Backpack package), using Backpack with internal libraries is a good fit. The resulting package will be buildable with Stack and cabal-install, as long as you are using GHC 8.2. This is probably the most pragmatic way you can make use of Backpack; the primary problem is that Haddock doesn't know how to deal with reexported modules, but this should be fixable.
  2. If you are willing to use cabal new-build only, then you can also write packages which have requirements, and let clients decide however they want to implement their packages.

Probably the biggest "real-world" impediment to using Backpack, besides any lurking bugs, is subpar support for Haddock. But if you are willing to overlook this (for now, in any case), please give it a try!

Categories: Offsite Blogs

Jasper Van der Jeugt: Lazy I/O and graphs: Winterfell to King's Landing

Planet Haskell - Mon, 01/16/2017 - 6:00pm

This post is about Haskell, and lazy I/O in particular. It is a bit longer than usual, so I will start with a high-level overview of what you can expect:

  • We talk about how we can represent graphs in a “shallow embedding”. This means we will not use a dedicated Graph type and rather represent edges by directly referencing other Haskell values.

  • This is a fairly good match when we want to encode infinite 1 graphs. When dealing with infinite graphs, there is no need to “reify” the graph and enumerate all the nodes and egdes – this would be futile anyway.

  • We discuss a Haskell implementation of shortest path search in a weighted graph that works on these infinite graphs and that has good performance characteristics.

  • We show how we can implement lazy I/O to model infinite graphs as pure values in Haskell, in a way that only the “necessary” parts of the graph are loaded from a database. This is done using the unsafeInterleaveIO primitive.

  • Finally, we discuss the disadvantages of this approach as well, and we review some of common problems associated with lazy I/O.

Let’s get to it!

As usual, this is a literate Haskell file, which means that you can just load this blogpost into GHCi and play with it. You can find the raw .lhs file here.

> {-# LANGUAGE OverloadedStrings #-} > {-# LANGUAGE ScopedTypeVariables #-} > import Control.Concurrent.MVar (MVar, modifyMVar, newMVar) > import Control.Monad (forM_, unless) > import Control.Monad.State (State, gets, modify, runState) > import Data.Hashable (Hashable) > import qualified Data.HashMap.Strict as HMS > import qualified Data.HashPSQ as HashPSQ > import Data.Monoid ((<>)) > import qualified Data.Text as T > import qualified Data.Text.IO as T > import qualified Database.SQLite.Simple as SQLite > import qualified System.IO.Unsafe as IO The problem at hand

As an example problem, we will look at finding the shortest path between cities in Westeros, the fictional location where the A Song of Ice and Fire novels (and HBO’s Game of Thrones) take place.

We model the different cities in a straightforward way. In addition to a unique ID used to identify them, they also have a name, a position (X,Y coordinates) and a list of reachable cities, with an associated time (in days) it takes to travel there. This travel time, also referred to as the cost, is not necessarily deducable from the sets of X,Y coordinates: some roads are faster than others.

> type CityId = T.Text > data City = City > { cityId :: CityId > , cityName :: T.Text > , cityPos :: (Double, Double) > , cityNeighbours :: [(Double, City)] > }

Having direct access to the neighbouring cities, instead of having to go through CityIds both has advantages and disadvantages.

On one hand, updating these values becomes cumbersome at best, and impossible at worst. If we wanted to change a city’s name, we would have to traverse all other cities to update possible references to the changed city.

On the other hand, it makes access more convenient (and faster!). Since we want a read-only view on the data, it works well in this case.

Getting the data

We will be using data extracted from, conveniently licensed under a Creative Commons license. You can find the complete SQL dump here. The schema of the database should not be too surprising:

CREATE TABLE cities ( id text PRIMARY KEY NOT NULL, name text NOT NULL, x float NOT NULL, y float NOT NULL ); CREATE TABLE roads ( origin text NOT NULL, destination text NOT NULL, cost float NOT NULL, PRIMARY KEY (origin, destination) ); CREATE INDEX roads_origin ON roads (origin);

The road costs have been generated by multiplying the actual distances with a random number uniformly chosen between 0.6 and 1.4. Cities have been (bidirectionally) connected to at least four closest neighbours. This ensures that every city is reachable.

We will use sqlite in our example because there is almost no setup involved. You can load this database by issueing:

curl -L | sqlite3 got.db

But instead of considering the whole database (which we’ll get to later), let’s construct a simple example in Haskell so we can demonstrate the interface a bit. We can use a let to create bindings that refer to one another easily.

> test01 :: IO () > test01 = do > let winterfell = City "wtf" "Winterfell" (-105, 78) > [(13, moatCailin), (12, whiteHarbor)] > whiteHarbor = City "wih" "White Harbor" (-96, 74) > [(15, braavos), (12, winterfell)] > moatCailin = City "mtc" "Moat Cailin" (-104, 72) > [(20, crossroads), (13, winterfell)] > braavos = City "brv" "Braavos" (-43, 67) > [(17, kingsLanding), (15, whiteHarbor)] > crossroads = City "crs" "Crossroads Inn" (-94, 58) > [(7, kingsLanding), (20, crossroads)] > kingsLanding = City "kgl" "King's Landing" (-84, 45) > [(7, crossroads), (17, kingsLanding)] > > printSolution $ > shortestPath cityId cityNeighbours winterfell kingsLanding

Illustration of test01

printSolution is defined as:

> printSolution :: Maybe (Double, [City]) -> IO () > printSolution Nothing = T.putStrLn "No solution found" > printSolution (Just (cost, path)) = T.putStrLn $ > "cost: " <> T.pack (show cost) <> > ", path: " <> T.intercalate " -> " (map cityName path)

We get exactly what we expect in GHCi:

*Main> test01 cost: 40.0, path: Winterfell -> Moat Cailin -> Crossroads Inn -> King's Landing

So far so good! Now let’s dig in to how shortestPath works.

The Shortest Path algorithm

The following algorithm is known as Uniform Cost Search. It is a variant of Dijkstra’s graph search algorithm that is able to work with infinite graphs (or graphs that do not fit in memory anyway). It returns the shortest path between a known start and goal in a weighted directed graph.

Because this algorithm attempts to solve the problem the right way, including keeping back references, it is not simple. Therefore, if you are only interested in the part about lazy I/O, feel free to skip to this section and return to the algorithm later.

We have two auxiliary datatypes.

BackRef is a wrapper around a node and the previous node on the shortest path to the former node. Keeping these references around is necessary to iterate a list describing the entire path at the end.

> data BackRef node = BackRef {brNode :: node, brPrev :: node}

We will be using a State monad to implement the shortest path algorithm. This is our state:

> data SearchState node key cost = SearchState > { ssQueue :: HashPSQ.HashPSQ key cost (BackRef node) > , ssBackRefs :: HMS.HashMap key node > }

In our state, we have:

  • A priority queue of nodes we will visit next in ssQueue, including back references. Using a priority queue will let us grab the next node with the lowest associated cost in a trivial way.

  • Secondly, we have the ssBackRefs map. That one serves two purposes: to keep track of which nodes we have already explored (the keys in the map), and to keep the back references of those locations (the values in the map).

These two datatypes are only used internally in the shortestPath function. Ideally, we would be able to put them in the where clause, but that is not possible in Haskell.

Instead of declaring a Node typeclass (possibly with associated types for the key and cost types), I decided to go with simple higher-order functions. We only need two of those function arguments after all: a function to give you a node’s key (nodeKey) and a function to get the node’s neighbours and associated costs (nodeNeighbours).

> shortestPath > :: forall node key cost. > (Ord key, Hashable key, Ord cost, Num cost) > => (node -> key) > -> (node -> [(cost, node)]) > -> node > -> node > -> Maybe (cost, [node]) > shortestPath nodeKey nodeNeighbours start goal =

We start by creating an initial SearchState for our algorithm. Our initial queue holds one item (implying that we need explore the start) and our initial back references map is empty (we haven’t explored anything yet).

> let startbr = BackRef start start > queue0 = HashPSQ.singleton (nodeKey start) 0 startbr > backRefs0 = HMS.empty > searchState0 = SearchState queue0 backRefs0

walk is the main body of the shortest path search. We call that and if we found a shortest path, we return its cost together with the path which we can reconstruct from the back references (followBackRefs).

> (mbCost, searchState1) = runState walk searchState0 in > case mbCost of > Nothing -> Nothing > Just cost -> Just > (cost, followBackRefs (ssBackRefs searchState1)) > where

Now, we have a bunch of functions that are used within the algorithm. The first one, walk, is the main body. We start by exploring the next node in the queue. By construction, this is always a node we haven’t explored before. If this node is the goal, we’re done. Otherwise, we check the node’s neighbours and update the queue with those neighbours. Then, we recursively call walk.

> walk :: State (SearchState node key cost) (Maybe cost) > walk = do > mbNode <- exploreNextNode > case mbNode of > Nothing -> return Nothing > Just (cost, curr) > | nodeKey curr == nodeKey goal -> > return (Just cost) > | otherwise -> do > forM_ (nodeNeighbours curr) $ \(c, next) -> > updateQueue (cost + c) (BackRef next curr) > walk

Exploring the next node is fairly easy to implement using a priority queue: we simply need to pop the element with the minimal priority (cost) using minView. We also need indicate that we reached this node and save the back reference by inserting that info into ssBackRefs.

> exploreNextNode > :: State (SearchState node key cost) (Maybe (cost, node)) > exploreNextNode = do > queue0 <- gets ssQueue > case HashPSQ.minView queue0 of > Nothing -> return Nothing > Just (_, cost, BackRef curr prev, queue1) -> do > modify $ \ss -> ss > { ssQueue = queue1 > , ssBackRefs = > HMS.insert (nodeKey curr) prev (ssBackRefs ss) > } > return $ Just (cost, curr)

updateQueue is called as new neighbours are discovered. We are careful about adding new nodes to the queue:

  1. If we have already explored this neighbour, we don’t need to add it. This is done by checking if the neighbour key is in ssBackRefs.
  2. If the neighbour is already present in the queue with a lower priority (cost), we don’t need to add it, since we want the shortest path. This is taken care of by the utility insertIfLowerPrio, which is defined below.
> updateQueue > :: cost -> BackRef node -> State (SearchState node key cost) () > updateQueue cost backRef = do > let node = brNode backRef > explored <- gets ssBackRefs > unless (nodeKey node `HMS.member` explored) $ modify $ \ss -> ss > { ssQueue = insertIfLowerPrio > (nodeKey node) cost backRef (ssQueue ss) > }

If the algorithm finishes, we have found the lowest cost from the start to the goal, but we don’t have the path ready. We need to reconstruct this by following the back references we saved earlier. followBackRefs does that for us. It recursively looks up nodes in the map, constructing the path in the accumulator acc on the way, until we reach the start.

> followBackRefs :: HMS.HashMap key node -> [node] > followBackRefs paths = go [goal] goal > where > go acc node0 = case HMS.lookup (nodeKey node0) paths of > Nothing -> acc > Just node1 -> > if nodeKey node1 == nodeKey start > then start : acc > else go (node1 : acc) node1

That’s it! The only utility left is the insertIfLowerPrio function. Fortunately, we can easily define this using the alter function from the psqueues package. That function allows us to change a key’s associated value and priority. It also allows to return an additional result, but we don’t need that, so we just use () there.

> insertIfLowerPrio > :: (Hashable k, Ord p, Ord k) > => k -> p -> v -> HashPSQ.HashPSQ k p v -> HashPSQ.HashPSQ k p v > insertIfLowerPrio key prio val = snd . HashPSQ.alter > (\mbOldVal -> case mbOldVal of > Just (oldPrio, _) > | prio < oldPrio -> ((), Just (prio, val)) > | otherwise -> ((), mbOldVal) > Nothing -> ((), Just (prio, val))) > key Interlude: A (very) simple cache

Lazy I/O will guarantee that we only load the nodes in the graph when necessary.

However, since we know that the nodes in the graph do not change over time, we can build an additional cache around it. That way, we can also guarantee that we only load every node once.

Implementing such a cache is very simple in Haskell. We can simply use an MVar, that will even take care of blocking 2 when we have concurrent access to the cache (assuming that is what we want).

> type Cache k v = MVar (HMS.HashMap k v) > newCache :: IO (Cache k v) > newCache = newMVar HMS.empty > cached :: (Hashable k, Ord k) => Cache k v -> k -> IO v -> IO v > cached mvar k iov = modifyMVar mvar $ \cache -> do > case HMS.lookup k cache of > Just v -> return (cache, v) > Nothing -> do > v <- iov > return (HMS.insert k v cache, v)

Note that we don’t really delete things from the cache. In order to keep things simple, we can assume that we will use a new cache for every shortest path we want to find, and that we throw away that cache afterwards.

Loading the graph using Lazy I/O

Now, we get to the main focus of the blogpost: how to use lazy I/O primitives to ensure resources are only loaded when they are needed. Since we are only concerned about one datatype (City) our loading code is fairly easy.

The most important loading function takes the SQLite connection, the cache we wrote up previously, and a city ID. We immediately use the cached combinator in the implementation, to make sure we load every CityId only once.

> getCityById > :: SQLite.Connection -> Cache CityId City -> CityId > -> IO City > getCityById conn cache id' = cached cache id' $ do

Now, we get some information from the database. We play it a bit loose here and assume a singleton list will be returned from the query.

> [(name, x, y)] <- SQLite.query conn > "SELECT name, x, y FROM cities WHERE id = ?" [id']

The neighbours are stored in a different table because we have a properly normalised database. We can write a simple query to obtain all roads starting from the current city:

> roads <- SQLite.query conn > "SELECT cost, destination FROM roads WHERE origin = ?" > [id'] :: IO [(Double, CityId)]

This leads us to the crux of the matter. The roads variable contains something of the type [(Double, CityId)], and what we really want is [(Double, City)]. We need to recursively call getCityById to load what we want. However, doing this “the normal way” would cause problems:

  1. Since the IO monad is strict, we would end up in an infinite loop if there is a cycle in the graph (which is almost always the case for roads and cities).
  2. Even if there was no cycle, we would run into trouble with our usage of MVar in the Cache. We block access to the Cache while we are in the cached combinator, so calling getCityById again would cause a deadlock.

This is where Lazy I/O shines. We can implement lazy I/O by using the unsafeInterleaveIO primitive. Its type is very simple and doesn’t look as threatening as unsafePerformIO.

unsafeInterleaveIO :: IO a -> IO a

It takes an IO action and defers it. This means that the IO action is not executed right now, but only when the value is demanded. That is exactly what we want!

We can simply wrap the recursive calls to getCityById using unsafeInterleaveIO:

> neighbours <- IO.unsafeInterleaveIO $ > mapM (traverse (getCityById conn cache)) roads

And then return the City we constructed:

> return $ City id' name (x, y) neighbours

Lastly, we will add a quick-and-dirty wrapper around getCityById so that we are also able to load cities by name. Its implementation is trivial:

> getCityByName > :: SQLite.Connection -> Cache CityId City -> T.Text > -> IO City > getCityByName conn cache name = do > [[id']] <- SQLite.query conn > "SELECT id FROM cities WHERE name = ?" [name] > getCityById conn cache id'

Now we can neatly wrap things up in our main function:

> main :: IO () > main = do > cache <- newCache > conn <- "got.db" > winterfell <- getCityByName conn cache "Winterfell" > kings <- getCityByName conn cache "King's Landing" > printSolution $ > shortestPath cityId cityNeighbours winterfell kings

This works as expected:

*Main> :main cost: 40.23610549037591, path: Winterfell -> Moat Cailin -> Greywater Watch -> Inn of the Kneeling Man -> Fairmarket -> Brotherhood Without Banners Hideout -> Crossroads Inn -> Darry -> Saltpans -> QuietIsle -> Antlers -> Sow's Horn -> Brindlewood -> Hayford -> King's Landing Disadvantages of Lazy I/O

Lazy I/O also has many disadvantages, which have been widely discussed. Among those are:

  1. Code becomes harder to reason about. In a setting without lazy I/O, you can casually reason about an Int as either an integer that’s already computed, or as something that will do some (pure) computation and then yield an Int.

    When lazy I/O enters the picture, things become more complicated. That Int you wanted to print? Yeah, it fired a bunch of missiles and returned the bodycount.

    This is why I would not seriously consider using lazy I/O when working with a team or on a large project – it can be easy to forget what is lazily loaded and what is not, and there’s no easy way to tell.

  2. Scarce resources can easily become a problem if you are not careful. If we keep a reference to a City in our heap, that means we also keep a reference to the cache and the SQLite connection.

    We must ensure that we fully evaluate the solution to something that doesn’t refer to these resources (to e.g. a printed string) so that the references can be garbage collected and the connections can be closed.

    Closing the connections is a problem in itself – if we cannot guarantee that e.g. streams will be fully read, we need to rely on finalizers, which are pretty unreliable…

  3. If we go a step further and add concurrency to our application, it becomes even tricker. Deadlocks are not easy to reason about – so how about reasoning about deadlocks when you’re not sure when the IO is going to be executed at all?

Despite all these shortcomings, I believe lazy I/O is a powerful and elegant tool that belongs in every Haskeller’s toolbox. Like pretty much anything, you need to be aware of what you are doing and understand the advantages as well as the disadvantages.

For example, the above downsides do not really apply if lazy I/O is only used within a module. For this blogpost, that means we could safely export the following interface:

> shortestPathBetweenCities > :: FilePath -- ^ Database name > -> CityId -- ^ Start city ID > -> CityId -- ^ Goal city ID > -> IO (Maybe (Double, [CityId])) -- ^ Cost and path > shortestPathBetweenCities dbFilePath startId goalId = do > cache <- newCache > conn <- dbFilePath > start <- getCityById conn cache startId > goal <- getCityById conn cache goalId > case shortestPath cityId cityNeighbours start goal of > Nothing -> return Nothing > Just (cost, path) -> > let ids = map cityId path in > cost `seq` foldr seq () ids `seq` > return (Just (cost, ids))

Thanks for reading – and I hope I was able to offer you a nuanced view on lazy I/O. Special thanks to Jared Tobin for proofreading.

  1. In this blogpost, I frequently talk about “infinite graphs”. Of course most of these examples are not truly infinite, but we can consider examples that do not fit in memory completely, and in that way we can regard them as “infinite for practical purposes”.

  2. While blocking is good in this case, it might hurt performance when running in a concurrent environment. A good solution to that would be to stripe the MVars based on the keys, but that is beyond the scope of this blogpost. If you are interested in the subject, I talk about it a bit here.

Categories: Offsite Blogs

Michael Snoyman: safe-prelude: a thought experiment

Planet Haskell - Sun, 01/15/2017 - 6:00pm

This blog post is to share a very rough first stab at a new prelude I played around with earlier this month. I haven't used it in any significant way, and haven't spent more than a few hours on it total. I wrote it because I knew it was the only way to get the idea out of my head, and am sharing it in case anyone finds the idea intriguing or useful.

The project is available on Github at snoyberg/safe-prelude, and I've uploaded the Haddocks for easier reading (though, be warned, they aren't well organized at all). The rest of this post is just a copy of the file for the project.

This is a thought experiment in a different point in the alternative prelude design space. After my blog post on readFile, I realized I was unhappy with the polymorphic nature of readFile in classy-prelude. Adding that with Haskell Pitfalls I've been itching to try something else. I have a lot of hope for the foundation project, but wanted to play with this in the short term.

  • No partial functions, period. If a function can fail, its return type must express that. (And for our purposes: IO functions with runtime exceptions are not partial.)
  • Choose best in class libraries and promote them. bytestring and text fit that bill, as an example. Full listing below.
  • Regardless of the versions of underlying libraries, this package will always export a consistent API, so that CPP usage should be constrained to just inside this package.
  • Use generalization (via type classes) when they are well established. For example: Foldable and Traversable yes, MonoFoldable no.

    • Controversial Avoid providing list-specific functions. This connects to the parent point. Most of the time, I'd argue that lists are not the correct choice, and instead a Vector should be used. There is no standard for sequence-like typeclasses (though many exist), so we're not going to generalize. But we're also not going to use a less efficient representation.

      I was torn on this, but decided in favor of leaving out functions initially, on the basis that it's easier to add something in later rather than remove it.

  • Encourage qualified imports with a consistent naming scheme. This is a strong departure from classy-prelude, which tried to make it unnecessary to use qualified imports. I'll save my feelings about qualified imports for another time, this is just a pragmatic choice given the other constraints.
  • Export any non-conflicting and not-discouraged names from this module that make sense, e.g. ByteString, Text, or readIORef.

This list may fall out of date, so check the .cabal file for a current and complete listing. I'm keeping this here to include reasoning for some libraries:

  • bytestring and text, despite some complaints, are clearly the most popular representation for binary and textual data, respectively
  • containers and unordered-containers are both commonly used. Due to lack of generalization, this library doesn't expose any functions for working with their types, but they are common enough that adding the dependency just for exposing the type name is worth it
  • safe-exceptions hides the complexity of asynchronous exceptions, and should be used in place of Control.Exception
  • transformers and mtl are clear winners in the monad transformer space, at least for now
  • While young, say has been very useful for me in avoiding interleaved output issues
  • Others without real competitors: deepseq, semigroups

Packages I considered but have not included yet:

  • stm is an obvious winner, and while I use it constantly, I'm not convinced everyone else uses it as much as I do. Also, there are some questions around generalizing its functions (e.g., atomically could be in MonadIO), and I don't want to make that decision yet.

    • stm-chans falls into this category too
  • async is an amazing library, and in particular the race, concurrently, and Concurrently bits are an easy win. I've left it out for now due to questions of generalizing to MonadBaseControl (see lifted-async and its .Safe module)

  • Similar argument applies to monad-unlift

  • I didn't bother with exposing the Vector type... because which one would I expose? The Vector typeclass? Boxed Vector? Unboxed? I could do the classy-prelude thing and define type UVector = Data.Vector.Unboxed.Vector, but I'd rather not do such renamings.

Qualified imports

Here are the recommend qualified imports when working with safe-prelude.

import qualified "bytestring" Data.ByteString as B import qualified "bytestring" Data.ByteString.Lazy as BL import qualified "text" Data.Text as T import qualified "text" Data.Text.Lazy as TL import qualified "containers" Data.Map.Strict as Map import qualified "containers" Data.Set as Set import qualified "unordered-containers" Data.HashMap.Strict as HashMap import qualified "unordered-containers" Data.HashSet as HashSet
Categories: Offsite Blogs

Dominic Steinitz: Calling Haskell from C

Planet Haskell - Sat, 01/14/2017 - 7:39am

As part of improving the random number generation story for Haskell, I want to be able to use the testu01 library with the minimal amount of Haskell wrapping. testu01 assumes that there is a C function which returns the random number. The ghc manual gives an example but does not give all the specifics. These are my notes on how to get the example working under OS X (El Capitain 10.11.5 to be precise).

The Haskell:

{-# OPTIONS_GHC -Wall #-} {-# LANGUAGE ForeignFunctionInterface #-} module Foo where foreign export ccall foo :: Int -> IO Int foo :: Int -> IO Int foo n = return (length (f n)) f :: Int -> [Int] f 0 = [] f n = n:(f (n-1))

The .cabal:

name: test-via-c version: homepage: TBD license: MIT author: Dominic Steinitz maintainer: category: System build-type: Simple cabal-version: >=1.10 executable Foo.dylib main-is: Foo.hs other-extensions: ForeignFunctionInterface build-depends: base >=4.7 && =0.6 && <0.7 hs-source-dirs: src default-language: Haskell2010 include-dirs: src ghc-options: -O2 -shared -fPIC -dynamic extra-libraries: HSrts-ghc8.0.1

On my computer running

cabal install

places the library in


The C:

#include #include "HsFFI.h" #include "../dist/build/Foo.dylib/Foo.dylib-tmp/Foo_stub.h" int main(int argc, char *argv[]) { int i; hs_init(&argc, &argv); for (i = 0; i < 5; i++) { printf("%d\n", foo(2500)); } hs_exit(); return 0; }

On my computer this can be compiled with

gcc-6 Bar.c ~/Library/Haskell/ghc-8.0.1/lib/test-via-c- -I/Library/Frameworks/GHC.framework/Versions/8.0.1-x86_64/usr/lib/ghc-8.0.1/include -L/Library/Frameworks/GHC.framework/Versions/8.0.1-x86_64/usr/lib/ghc-8.0.1/rts -lHSrts-ghc8.0.1

and can be run with

DYLD_LIBRARY_PATH= ~/Library/Haskell/ghc-8.0.1/lib/test-via-c- /Library/Frameworks/GHC.framework/Versions/8.0.1-x86_64/usr/lib/ghc-8.0.1/rts

N.B. setting DYLD_LIBRARY_PATH like this is not recommended as it is a good way of breaking things. I have tried setting DYLD_FALLBACK_LIBRARY_PATH but only to get an error message. Hopefully, at some point I will be able to post a robust way of getting the executable to pick up the required dynamic libraries.

Categories: Offsite Blogs

Brent Yorgey: My new programming languages course

Planet Haskell - Fri, 01/13/2017 - 3:55pm

tl;dr: my new PL course is now finished, and all the course materials are freely available. Working through all the exercises should be a great option for anyone wishing to learn some basics of programming language design and implementation.

Last May, I wrote about my ideas for designing a new PL course, and got a lot of great comments and feedback. Well, somehow I survived the semester, and the course is now over. In the end I’m pretty happy with how it went (though of course there are always things that can be improved next time).

I decided to use class time in an unconventional way: for each class meeting I created a “module”, consisting of a literate Haskell file with some example code, explanatory text, and lots of holes where students needed to write answers to exercises or fill in code. I split the students into groups, and they spent class time just working through the module. Instead of standing at the front lecturing, I just wandered around watching them work and answering questions. It took a bit of getting used to—for the first few classes I couldn’t shake the feeling that I wasn’t really doing my job—but it quickly became clear that the students were really learning and engaging with the material in a way that they would not have been able to if I had just lectured.

A happy byproduct of this approach is that the modules are fairly self-contained and can now be used by anyone to learn the material. Reading through all the modules and working through the exercises should be a great option for anyone wishing to learn some basics of programming language design and implementation. For example, I know I will probably reuse it to get summer research students up to speed. Note that the course assumes no knowledge of Haskell (so those familiar with Haskell can safely skip the first few modules), but introduces just enough to get where I want to go.

I don’t plan to release any solutions, so don’t ask. But other than that, questions, comments, bug reports, etc. are welcome!

Categories: Offsite Blogs

FP Complete: Containerizing a legacy application: an overview

Planet Haskell - Thu, 01/12/2017 - 9:45am

An overview of what containerization is, the reasons to consider running a legacy application in Docker containers, the process to get it there, the issues you may run into, and next steps once you are deploying with containers. You'll reduce the stress of deployments, and take your first steps on the path toward no downtime and horizontal scaling.

Contact FP Complete

EmailNameMessageContact FP Complete

Or you can email us at

Note: This post focuses on simplifying deployment of the application. It does not cover topics that may require re-architecting parts of the application, such as high-availability and horizontal scaling.

ConceptsWhat is a "Legacy" App?

There's no one set of attributes that typifies all legacy apps, but common attributes include:

  • Using the local filesystem for persistent storage, with data files intermingled with application files.
  • Running many services on one server, such as a MySQL database, Redis server, Nginx web server, a Ruby on Rails application, and a bunch of cron jobs.
  • Installation and upgrades use a hodgepodge of scripts and manual processes (often poorly documented).
  • Configuration is stored in files, often in multiple places and intermingled with application files.
  • Inter-process communication uses the local filesystem (e.g. dropping files in one place for another process to pick up) rather than TCP/IP.
  • Designed assuming one instance on the application would run on a single server.
Disadvantages of the legacy approach
  • Automating deployments is difficult
  • If you need multiple customized instances of the application, it's hard to "share" a single server between multiple instances.
  • If the server goes down, can take a while to replace due to manual processes.
  • Deploying new versions is a fraught manual or semi-manual process which is hard to roll back.
  • It's possible for test and production environments to drift apart, which leads to problems in production that were not detected during testing.
  • You cannot easily scale horizontally by adding more instances of the application.
What is "Containerization"?

"Containerizing" an application is the process of making it able to run and deploy under Docker containers and similar technologies that encapsulate an application with its operating system environment (a full system image). Since containers provide the application with an environment very similar to having full control of a system, this is a way to begin modernizing the deployment of the application while making minimal or no changes to the application itself. This provides a basis for incrementally making the application's architecture more "cloud-friendly."

Benefits of Containerization
  • Deployment becomes much easier: replacing the whole container image with a new one.
  • It's relatively easy to automate deployments, even having them driven completely from a CI (continuous integration) system.
  • Rolling back a bad deployment is just a matter of switching back to the previous image.
  • It's very easy to automate application updates since there are no "intermediate state" steps that can fail (either the whole deployment succeeds, or it all fails).
  • The same container image can be tested in a separate test environment, and then deployed to the production environment. You can be sure that what you tested is exactly the same as what is running in production.
  • Recovering a failed system is much easier, since a new container with exactly the same application can be automatically spun up on new hardware and attached to the same data stores.
  • Developers can also run containers locally to test their work in progress in a realistic environment.
  • Hardware can be used more efficiently, by running multiple containerized applications on a single host that ordinarily could not easily share a single system.
  • Containerizing is a good first step toward supporting no-downtime upgrades, canary deployments, high availability, and horizontal scaling.
Alternatives to containerization
  • Configuration management tools like Puppet and Chef help with some of the "legacy" issues such as keeping environments consistent, but they do not support the "atomic" deployment or rollback of the entire environment and application at once. This can still go wrong partway through a deployment with no easy way to roll everything back.

  • Virtual machine images are another way to achieve many of the same goals, and there are cases where it makes more sense to do the "atomic" deployment operations using entire VMs rather than containers running on a host. The main disadvantage is that hardware utilization may be less efficient, since VMs need dedicated resources (CPU, RAM, disk), whereas containers can share a single host's resources between them.

How to containerizePreparationIdentify filesystem locations where persistent data is written

Since deploying a new version of the application is performed by replacing the Docker image, any persistent data must be stored outside of the container. If you're lucky, the application already writes all its data to a specific path, but many legacy applications spread their data all over the filesystem and intermingle it with the application itself. Either way, Docker's volume mounts let us expose the host's filesystem to specific locations in the container filesystem so that data survives between containers, so we must identify the locations to persist.

You may at this stage consider modifying the application to support writing all data within a single tree in the filesystem, as that will simplify deployment of the containerized version. However, this is not necessary if modifying the application is impractical.

Identify configuration files and values that will vary by environment

Since a single image should be usable in multiple environments (e.g. test and production) to ensure consistency, any configuration values that will vary by environment must be identified so that the container can be configured at startup time. These could take the form of environment variables, or of values within one or more configuration files.

You may at this stage want to consider modifying the application to support reading all configuration from environment variables, as that that will simplify containerizing it. However, this is not necessary if modifying the application is impractical.

Identify services that can be easily externalized

The application may use some services running on the local machine that are easy to externalize due to being highly independent and supporting communication by TCP/IP. For example, if you run a database such as MySQL or PostgreSQL or a cache such as Redis on the local system, that should be easy to run externally. You may need to adjust configuration to support specifying a hostname and port rather than assuming the service can be reached on localhost.

Creating the imageCreate a Dockerfile that installs the application

If you already have the installation process automated via scripts or using a configuration management tool such as Chef or Puppet, this should be relatively easy. Start with an image of your preferred operating system, install any prerequisites, and then run the scripts.

If the current setup process is more manual, this will involve some new scripting. But since the exact state of the image is known, it's easier to script the process than it would be when you have to deal with the potentially inconsistent state of a raw system.

If you identified externalizable services earlier, you should modify the scripts to not install them.

A simple example Dockerfile:

# Start with an official Ubuntu 16.04 Docker image FROM ubuntu:16.04 # Install prerequisite Ubuntu packages RUN apt-get install -y <REQUIRED UBUNTU PACKAGES> \ && apt-get clean \ && rm -rf /var/lib/apt/lists/* # Copy the application into the image ADD . /app # Run the app setup script RUN /app/ # Switch to the application directory WORKDIR /app # Specify the application startup script COMMAND /app/start.shStartup script for configuration

If the application takes all its configuration as environment variables already, then you don't need to do anything. However, if you have environment-dependent configuration values in configuration files, you will need to create an application startup script that reads these values from environment variables and then updates the configuration files.

An simple example startup script:

#!/usr/bin/env bash set -e # Append to the config file using $MYAPPCONFIG environment variable. cat >>/app/config.txt <<END my_app_config = "${MYAPPCONFIG}" END # Run the application using $MYAPPARG environment variable for an argument. /app/bin/my-app --my-arg="${MYAPPARG}"Push the image

After building the image (using docker build), it must be pushed to a Docker Registry so that it can be pulled on the machine where it will deployed (if you are running on the same machine as the image was built on, then this is not necessary).

You can use Docker Hub for images (a paid account lets you create private image repositories), or most cloud providers also provide their own container registries (e.g. Amazon ECR).

Give the image a tag (e.g. docker tag myimage mycompany/myimage:mytag) and then push it (e.g. docker push mycompany/myimage:mytag). Each image for a version of the application should have a unique tag, so that you always know which version you're using and so that images for older versions are available to roll back to.

How to deploy

Deploying containers is a big topic, and this section just focuses on directly running containers using docker commands. Tools like docker-compose (for simple cases where all containers run on a single server) and Kubernetes (for container orchestration across a cluster) should be considered in real-world usage.

Externalized services

Services you identified for externalization earlier can be run in separate Docker containers that will be linked to the main application. Alternatively, it is often easiest to outsource to managed services. For example, if you are using AWS, using RDS for a database or Elasticache for a cache significantly simplifies your life since they take care of maintenance, high availability, and backups for you.

An example of running a Postgres database container:

docker run \ -d \ --name db \ -v /usr/local/var/docker/volumes/postgresql/data:/var/lib/postgresql/data \ postgresThe application

To run the application in a Docker container, you use a command-line such as this:

docker run \ -d \ -p 8080:80 \ --name myapp \ -v /usr/local/var/docker/volumes/myappdata:/var/lib/myappdata \ -e MYAPPCONFIG=myvalue \ -e MYAPPARG=myarg \ --link db:db \ myappimage:mytag

The -p argument exposes the container's port 80 on the host's port 8080, -v argument sets up the volume mount for persistent data (in the hostpath:containerpath format), the -e argument sets a configuration environment variable (these may both be repeated for additional volumes and variables), and the --link argument links the database container so the application can communicate with it. The container will be started with the startup script you specified in the Dockerfile's COMMAND.


To upgrade to a new version of the application, stop the old container (e.g., docker rm -f myapp) and start a new one with the new image tag (this will require a brief down time). Rolling back is the similar, except that you use the old image tag.

Additional considerations"init" process (PID 1)

Legacy applications often run multiple processes, and it's not uncommon for orphan processes to accumulate if there is no "init" (PID 1) daemon to clean them up. Docker does not, by default, provide such a daemon, so it's recommended to add one as the ENTRYPOINT in your Dockerfile. dumb-init is an example lightweight init daemon, among others. phusion/baseimage is a fully-featured base image that includes an init daemon in addition to other services.

See our blog post dedicated to this topic: Docker demons: PID-1, orphans, zombies, and signals.

Daemons and cron jobs

The usual way to use Docker containers is to have a single process per container. Ideally, any cron jobs and daemons can be externalized into separate containers, but this is not always possible in legacy applications without re-architecting them. There is no intrinsic reason why containers cannot run many processes, but it does require some extra setup since standard base images do not include process managers and schedulers. Minimal process supervisors, such as runit, are more appropriate to use in containers than full-fledged systems like systemd. phusion/baseimage is a fully-featured base image that includes runit and cron, in addition to other services.

Volume-mount permissions

It's common (though not necessarily recommended) to run all processes in containers as the root user. Legacy applications often have more complex user requirements, and may need to run as a different user (or multiple processes as multiple users). This can present a challenge when using volume mounts, because Docker makes the mount points owned by root by default, which means non-root processes will not be able to write to them. There are two ways to deal with this.

The first approach is to create the directories on the host first, owned by the correct UID/GID, before starting the container. Note that since the container and host's users don't match up, you have to be careful to use the same UID/GID as the container, and not merely the same usernames.

The other approach is for the container itself to adjust the ownership of the mount points during its startup. This has to happen while running as root, before switching to a non-root user to start the application.

Database migrations

Database schema migrations always present a challenge for deployments, because the database schema can be very tightly coupled with the application, and that makes controlling the timing of the migration important, as well as making rolling back to an older version of the application more difficult since database migrations can't always be rolled back easily.

A way to mitigate this easily is to have a staged approach to migrations. You need to make an incompatible schema change, you split that change over two application deployments. For example, if you want to move a piece of data from one location to another, these would be the phases:

  1. Write the data to both the old and new locations, and read it from the new location. This means that if you roll the application back to the previous version, any the new data is still where it expects to find it.

  2. Stop writing it to the old location.

Note that if you want to have deployments with no downtime, that means running multiple versions of the application at the same time, which makes this even more of a challenge.

Backing up data

Backing up from a containerized application is usually easier than the non-containerized deployment. Data files can be backed up from the host and you don't risk any intermingling of data files with application files because they are strictly separated. If you've moved databases to managed services such as RDS, those can take care of backups for you (at least if your needs are relatively simple).

Migrating existing data

To transition the production application to the new containerized version, you will need to migrate the old deployment's data. How to do this will vary, but usually the simplest is to stop the old deployment, back up all the data, and restore it to the new deployment. This should be practiced in advance, and will necessitate some down time.


While it requires some up-front work, containerizing a legacy application will help you get control of, automate, and minimize the stress of deploying it. It sets you on a path toward modernizing your application and supporting no-downtime deployments, high availability, and horizontal scaling.

FP Complete has undertaken this process many times in addition to building containerized applications from the ground up. If you'd like to get on the path to modern and stress-free deployment of your applications, you can learn more about our Devops and Consulting services, or contact us straight away!

Contact FP Complete

EmailNameMessageContact FP Complete

Or you can email us at

Categories: Offsite Blogs

The complexity of abstract machines

Lambda the Ultimate - Wed, 01/11/2017 - 7:09pm

I previously wrote about a brand of research by Guy Blelloch on the Cost semantics for functional languages, which let us make precise claim about the complexity of functional programs without leaving their usual and comfortable programming models (beta-reduction).

While the complexity behavior of weak reduction strategies, such as call-by-value and call-by-name, is by now relatively well-understood, the lambda-calculus has a much richer range of reduction strategies, in particular those that can reduce under lambda-abstractions, whose complexity behavior is sensibly more subtle and was, until recently, not very well understood. (This has become a practical concern since the rise in usage of proof assistants that must implement reduction under binders and are very concerned about the complexity of their reduction strategy, which consumes a lot of time during type/proof-checking.)

Beniamino Accatoli, who has been co-authoring a lot of work in that area, recently published on arXiv a new paper that has survey quality, and is a good introduction to this area of work and other pointers from the literature.

The Complexity of Abstract Machines

Beniamino Accatoli, 2017

The lambda-calculus is a peculiar computational model whose definition does not come with a notion of machine. Unsurprisingly, implementations of the lambda-calculus have been studied for decades. Abstract machines are implementations schema for fixed evaluation strategies that are a compromise between theory and practice: they are concrete enough to provide a notion of machine and abstract enough to avoid the many intricacies of actual implementations. There is an extensive literature about abstract machines for the lambda-calculus, and yet -- quite mysteriously -- the efficiency of these machines with respect to the strategy that they implement has almost never been studied.

This paper provides an unusual introduction to abstract machines, based on the complexity of their overhead with respect to the length of the implemented strategies. It is conceived to be a tutorial, focusing on the case study of implementing the weak head (call-by-name) strategy, and yet it is an original re-elaboration of known results. Moreover, some of the observation contained here never appeared in print before.

Categories: Offsite Discussion

Toby Goodwin: Artificial Superintelligence

Planet Haskell - Wed, 01/11/2017 - 3:50pm

I like Tim Urban's Wait But Why? site (tagline: new post every sometimes). But I thought his article on The AI Revolution - The Road to Superintelligence was dead wrong.

Vaguely at the back of my mind was that I ought to try to write some kind of rebuttal, but it would have taken a lot of time (which I don't have) to research the topic properly, and write it up.

So I was delighted to come across Superintelligence - The Idea That Eats Smart People which does a far better job than I ever could have done.

I recommend reading both of them.

Categories: Offsite Blogs

The GHC Team: GHC 8.0.2 is available!

Planet Haskell - Wed, 01/11/2017 - 12:40pm

The GHC team is happy to at last announce the 8.0.2 release of the Glasgow Haskell Compiler. Source and binary distributions are available at


This is the second release of the 8.0 series and fixes nearly two-hundred bugs. These include,

  • Interface file build determinism (#4012).
  • Compatibility with macOS Sierra and GCC compilers which compile position-independent executables by default
  • Compatibility with systems which use the gold linker
  • Runtime linker fixes on Windows (see #12797)
  • A compiler bug which resulted in undefined reference errors while compiling some packages (see #12076)
  • A number of memory consistency bugs in the runtime system
  • A number of efficiency issues in the threaded runtime which manifest on larger core counts and large numbers of bound threads.
  • A typechecker bug which caused some programs using -XDefaultSignatures to be incorrectly accepted.
  • More than two-hundred other bugs. See ​Trac for a complete listing.
  • #12757, which lead to broken runtime behavior and even crashes in the presence of primitive strings.
  • #12844, a type inference issue affecting partial type signatures.
  • A bump of the directory library, fixing buggy path canonicalization behavior (#12894). Unfortunately this required a major version bump in directory and minor bumps in several other libraries.
  • #12912, where use of the select system call would lead to runtime system failures with large numbers of open file handles.
  • #10635, wherein -Wredundant-constraints was included in the -Wall warning set

A more detailed list of the changes included in this release can be found in the ​release notes.

Please note that this release breaks with our usual tendency to avoid major version bumps of core libraries in minor GHC releases by including an upgrade of the directory library to

Also note that, due to a rather serious bug (#13100) affecting Windows noticed late in the release cycle, the Windows binary distributions were produced using a slightly ​patched source tree. Users compiling from source for Windows should be certain to include this patch in their build.

This release is the result of six months of effort by the GHC development community. We'd like to thank everyone who has contributed code, bug reports, and feedback to this release. It's only due to their efforts that GHC remains a vibrant and exciting project.

How to get it

Both the source tarball and binary distributions for a wide variety of platforms are available here.


Haskell is a standardized lazy functional programming language.

The Glasgow Haskell Compiler (GHC) is a state-of-the-art programming suite for Haskell. Included is an optimising compiler generating efficient code for a variety of platforms, together with an interactive system for convenient, quick development. The distribution includes space and time profiling facilities, a large collection of libraries, and support for various language extensions, including concurrency, exceptions, and foreign language interfaces. GHC is distributed under a BSD-style open source license.

Supported Platforms

The list of platforms we support, and the people responsible for them, can be found on the GHC wiki

Ports to other platforms are possible with varying degrees of difficulty. The Building Guide describes how to go about porting to a new platform.


We welcome new contributors. Instructions on getting started with hacking on GHC are available from GHC's ​developer site.

Community Resources

There are mailing lists for GHC users, develpoers, and monitoring bug tracker activity; to subscribe, use the Mailman ​web interface.

There are several other Haskell and GHC-related mailing lists on ​; for the full list, see the ​lists page.

Some GHC developers hang out on the #ghc and #haskell of the Freenode IRC network, too. See the ​Haskell wiki for details.

Please report bugs using our bug tracking system. Instructions on reporting bugs can be found here.

Categories: Offsite Blogs

Christopher Done: Fast Haskell: Competing with C at parsing XML

Planet Haskell - Tue, 01/10/2017 - 6:00pm

In this post we’re going to look at parsing XML in Haskell, how it compares with an efficient C parser, and steps you can take in Haskell to build a fast library from the ground up. We’re going to get fairly detailed and get our hands dirty.

A new kid on the block

A few weeks ago Neil Mitchell posted a blog post about a new XML library that he’d written. The parser is written in C, and the API is written in Haskell which uses the C library. He writes that it’s very fast:

Hexml has been designed for speed. In the very limited benchmarks I’ve done it is typically just over 2x faster at parsing than Pugixml, where Pugixml is the gold standard for fast XML DOM parsers. In my uses it has turned XML parsing from a bottleneck to an irrelevance, so it works for me.

In order to achieve that speed, he cheats by not performing operations he doesn’t care about:

To gain that speed, Hexml cheats. Primarily it doesn’t do entity expansion, so &amp; remains as &amp; in the output. It also doesn’t handle CData sections (but that’s because I’m lazy) and comment locations are not remembered. It also doesn’t deal with most of the XML standard, ignoring the DOCTYPE stuff. [..] I only work on UTF8, which for the bits of UTF8 I care about, is the same as ASCII - I don’t need to do any character decoding.

Cheating is fine when you describe in detail how you cheat. That’s just changing the rules of the game!

But C has problems

This post caught my attention because it seemed to me a pity to use C. Whether you use Haskell, Python, or whatever, there are a few problems with dropping down to C from your high-level language:

  • The program is more likely to segfault. I’ll take an exception over a segfault any day!
  • The program opens itself up to possible exploitation due to lack of memory safety.
  • If people want to extend your software, they have to use C, and not your high-level language.
  • Portability (i.e. Windows) is a pain in the arse with C.

Sure enough, it wasn’t long before Austin Seipp posted a rundown of bugs in the C code:

At the moment, sorry to say – I wouldn’t use this library to parse any arbitrary XML, since it could be considered hostile, and get me owned. Using American Fuzzy Lop, just after a few minutes, I’ve already found around ~30 unique crashes.

But C is really fast right? Like 100s of times faster than Haskell! It’s worth the risk.

But-but C is fast!

Let’s benchmark it. We’re going to parse a 4KB, a 31KB and a 211KB XML file.

Using the Criterion benchmarking package, we can compare Hexml against the pretty old Haskell xml package…

File hexml xml 4KB 6.26 μs 1.94 ms (1940 μs) 31KB 9.41 μs 13.6 ms (13600 μs) 211KB 260 μs 25.9 ms (25900 μs)

Ouch! Those numbers don’t look good. The xml package is 100-300x times slower.

Okay, I’m being unfair. The xml package isn’t known for speed. Its package description is simply A simple XML library. Let’s compare with the hexpat package. That one has this in its description:

The design goals are speed, speed, speed, interface simplicity and modularity.

So that’s probably more representing the best in Haskell XML parsers. It’s also based on the C expat library, which is supposed to be fast.

File hexml hexpat 4KB 6.395 μs 320.3 μs 31KB 9.474 μs 378.3 μs 211KB 256.2 μs 25.68 ms

That’s a bit better. We’re now between 40-100x slower than Hexml. I’d prefer 10x slower, but it’s a more reasonable outcome. The hexpat package handles: keeping location information, reasonable parse errors, the complete XML standard. Hexml doesn’t do any of that.

Let’s set us a challenge. Can we match or beat the Hexml package in plain old Haskell? This is an itch that got under my skin. I emailed Neil and he was fine with it:

I don’t think it’s unfair or attacky to use Hexml as the baseline - I’d welcome it!

I’ll walk you through my approach. I called my library Xeno (for obvious reasons).

Start with the simplest thing possible

…and make sure it’s fast. Here’s the first thing I wrote, to see how fast it was to walk across a file compared with Hexml.

module Xeno (parse) where import Data.ByteString (ByteString) import qualified Data.ByteString as S import Data.Word -- | Parse an XML document. parse :: ByteString -> () parse str = parseTags 0 where parseTags index = case elemIndexFrom 60 str index of Nothing -> () Just fromLt -> case elemIndexFrom 62 str fromLt of Nothing -> () Just fromGt -> do parseTags (fromGt + 1) -- | Get index of an element starting from offset. elemIndexFrom :: Word8 -> ByteString -> Int -> Maybe Int elemIndexFrom c str offset = fmap (+ offset) (S.elemIndex c (S.drop offset str)) {-# INLINE elemIndexFrom #-}

The numbers 60 and 62 are < and >. In XML the only characters that matter are < and > (if you don’t care about entities). < and > can’t appear inside speech marks (attributes). They are the only important things to search for. Results:

File hexml xeno 4KB 6.395 μs 2.630 μs 42KB 37.55 μs 7.814 μs

So the baseline performance of walking across the file in jumps is quite fast! Why is it fast? Let’s look at that for a minute:

  • The ByteString data type is a safe wrapper around a vector of bytes. It’s underneath equivalent to char* in C.
  • With that in mind, the S.elemIndex function is implemented using the standard C function memchr(3). As we all know, memchr jumps across your file in large word boundaries or even using SIMD operations, meaning it’s bloody fast. But the elemIndex function itself is safe.

So we’re effectively doing a for(..) { s=memchr(s,..) } loop over the file.

Keep an eye on the allocations

Using the weigh package for memory allocation tracking, we can also look at allocations of our code right now:

Case Bytes GCs Check 4kb parse 1,168 0 OK 42kb parse 1,560 0 OK 52kb parse 1,168 0 OK 182kb parse 1,168 0 OK

We see that it’s constant. Okay, it varies by a few bytes, but it doesn’t increase linearly or anything. That’s good! One thing that stood out to me, is that didn’t we pay for allocation of the Maybe values. For a 1000x < and > characters, we should have 1000 allocations of Just/Nothing. Let’s go down that rabbit hole for a second.

Looking at the Core

Well, if you compile the source like this

stack ghc -- -O2 -ddump-simpl Xeno.hs

You’ll see a dump of the real Core code that is generated after the Haskell code is desugared, and before it’s compiled to machine code. At this stage you can already see optimizations based on inlining, common-sub-expression elimination, deforestation, and other things.

The output is rather large. Core is verbose, and fast code tends to be longer. Here is the output, but you don’t have to understand it. Just note that there’s no mention of Maybe, Just or Nothing in there. It skips that altogether. See here specifically. There is a call to memchr, then there is an eqAddr comparison with NULL, to see whether the memchr is done or not. But we’re still doing safety checks so that the resulting code is safe.

Inlining counts

The curious reader might have noticed that INLINE line in my first code sample.

{-# INLINE elemIndexFrom #-}

Without the INLINE, the whole function is twice as slow and has linear allocation.

Case Bytes GCs Check 4kb parse 1,472 0 OK 42kb parse 1,160 0 OK 52kb parse 1,160 0 OK benchmarking 4KB/xeno time 2.512 μs (2.477 μs .. 2.545 μs) benchmarking 211KB/xeno time 129.9 μs (128.7 μs .. 131.2 μs) benchmarking 31KB/xeno time 1.930 μs (1.909 μs .. 1.958 μs)


Case Bytes GCs Check 4kb parse 12,416 0 OK 42kb parse 30,080 0 OK 52kb parse 46,208 0 OK benchmarking 4KB/xeno time 5.258 μs (5.249 μs .. 5.266 μs) benchmarking 211KB/xeno time 265.9 μs (262.4 μs .. 271.4 μs) benchmarking 31KB/xeno time 3.212 μs (3.209 μs .. 3.218 μs)

Always pay attention to things like this. You don’t want to put INLINE on everything. Sometimes it adds slowdown, most times it makes no difference. So check with your benchmark suite.

Loop unrolling manually

Some things need to be done manually. I added comment parsing to our little function:

+ Just fromLt -> checkOpenComment (fromLt + 1) + checkOpenComment index = + if S.isPrefixOf "!--" (S.drop index str) + then findCommentEnd (index + 3) + else findLt index + findCommentEnd index = + case elemIndexFrom commentChar str index of + Nothing -> () -- error! + Just fromDash -> + if S.isPrefixOf "->" (S.drop (fromDash + 1) str) + then findGt (fromDash + 2) + else findCommentEnd (fromDash + 1)

And it became 2x slower:

benchmarking 4KB/xeno time 2.512 μs (2.477 μs .. 2.545 μs)


benchmarking 4KB/xeno time 4.296 μs (4.240 μs .. 4.348 μs)

So I changed the S.isPrefixOf to be unrolled to S.index calls, like this:

- if S.isPrefixOf "!--" (S.drop index str) - then findCommentEnd (index + 3) - else findLt index + if S.index this 0 == bangChar && + S.index this 1 == commentChar && + S.index this 2 == commentChar + then findCommentEnd (index + 3) + else findLt index + where + this = S.drop index str

And it dropped back down to our base speed again.

Finding tag names

I implemented finding tag names like this:

+ findTagName index0 = + case S.findIndex (not . isTagName) (S.drop index str) of + Nothing -> error "Couldn't find end of tag name." + Just ((+ index) -> spaceOrCloseTag) -> + if S.head this == closeTagChar + then findGt spaceOrCloseTag + else if S.head this == spaceChar + then findLt spaceOrCloseTag + else error + ("Expecting space or closing '>' after tag name, but got: " ++ + show this) + where this = S.drop spaceOrCloseTag str + where + index = + if S.head (S.drop index0 str) == questionChar || + S.head (S.drop index0 str) == slashChar + then index0 + 1 + else index0

And immediately noticed a big slow down. From

Case Bytes GCs Check 4kb parse 1,160 0 OK 42kb parse 1,472 0 OK 52kb parse 1,160 0 OK Benchmark xeno-memory-bench: FINISH Benchmark xeno-speed-bench: RUNNING... benchmarking 4KB/hexml time 6.149 μs (6.125 μs .. 6.183 μs) benchmarking 4KB/xeno time 2.691 μs (2.665 μs .. 2.712 μs)


Case Bytes GCs Check 4kb parse 26,096 0 OK 42kb parse 65,696 0 OK 52kb parse 102,128 0 OK Benchmark xeno-memory-bench: FINISH Benchmark xeno-speed-bench: RUNNING... benchmarking 4KB/hexml time 6.225 μs (6.178 μs .. 6.269 μs) benchmarking 4KB/xeno time 10.34 μs (10.06 μs .. 10.59 μs)

The first thing that should jump out at you is the allocations. What’s going on there? I looked in the profiler output, by running stack bench --profile to see a profile output.

Wed Jan 11 17:41 2017 Time and Allocation Profiling Report (Final) xeno-speed-bench +RTS -N -p -RTS 4KB/xeno total time = 8.09 secs (8085 ticks @ 1000 us, 1 processor) total alloc = 6,075,628,752 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc parse.findTagName Xeno 35.8 72.7 getOverhead Criterion.Monad 13.6 0.0 parse.checkOpenComment Xeno 9.9 0.0 parse.findLT Xeno 8.9 0.0 parse Xeno 8.4 0.0 >>= Data.Vector.Fusion.Util 4.6 7.7 getGCStats Criterion.Measurement 2.8 0.0 basicUnsafeIndexM Data.Vector.Primitive 1.6 2.0 fmap Data.Vector.Fusion.Stream.Monadic 1.3 2.2 rSquare.p Statistics.Regression 1.3 1.5 basicUnsafeWrite Data.Vector.Primitive.Mutable 1.2 1.4 innerProduct.\ Statistics.Matrix.Algorithms 1.0 1.6 qr.\.\ Statistics.Matrix.Algorithms 0.8 1.2 basicUnsafeSlice Data.Vector.Primitive.Mutable 0.5 1.1 transpose Statistics.Matrix 0.5 1.3

Right at the top, we have findTagName, doing all the allocations. So I looked at the code, and found that the only possible thing that could be allocating, is S.drop. This function skips n elements at the start of a ByteString. It turns out that S.head (S.drop index0 str) was allocating an intermediate string, just to get the first character of that string. It wasn’t copying the whole string, but it was making a new pointer to it.

So I realised that I could just replace S.head (S.drop n s) with S.index s n:

- if S.head this == closeTagChar + if S.index str spaceOrCloseTag == closeTagChar then findLT spaceOrCloseTag - else if S.head this == spaceChar + else if S.index str spaceOrCloseTag == spaceChar then findGT spaceOrCloseTag else error "Expecting space or closing '>' after tag name." - where this = S.drop spaceOrCloseTag str where index = - if S.head (S.drop index0 str) == questionChar || - S.head (S.drop index0 str) == slashChar + if S.index str index0 == questionChar || + S.index str index0 == slashChar

And sure enough, the allocations disappeared:

Case Bytes GCs Check 4kb parse 1,160 0 OK 42kb parse 1,160 0 OK 52kb parse 1,472 0 OK Benchmark xeno-memory-bench: FINISH Benchmark xeno-speed-bench: RUNNING... benchmarking 4KB/hexml time 6.190 μs (6.159 μs .. 6.230 μs) benchmarking 4KB/xeno time 4.215 μs (4.175 μs .. 4.247 μs)

Down to 4.215 μs. That’s not as fast as our pre-name-parsing 2.691 μs. But we had to pay something for the extra operations per tag. We’re just not allocating anymore, which is great.

SAX for free

Eventually I ended up with a function called process that parses XML and triggers events in a SAX style:

process :: Monad m => (ByteString -> m ()) -- ^ Open tag. -> (ByteString -> ByteString -> m ()) -- ^ Tag attribute. -> (ByteString -> m ()) -- ^ End open tag. -> (ByteString -> m ()) -- ^ Text. -> (ByteString -> m ()) -- ^ Close tag. -> ByteString -> m ()

Thanks again to GHC’s optimizations, calling this function purely and doing nothing is exactly equal to the function before SAX-ization:

-- | Parse the XML but return no result, process no events. validate :: ByteString -> Bool validate s = case spork (runIdentity (process (\_ -> pure ()) (\_ _ -> pure ()) (\_ -> pure ()) (\_ -> pure ()) (\_ -> pure ()) s)) of Left (_ :: XenoException) -> False Right _ -> True Case Bytes GCs Check 4kb parse 1,472 0 OK 42kb parse 1,160 0 OK 52kb parse 1,472 0 OK benchmarking 4KB/xeno time 4.320 μs (4.282 μs .. 4.361 μs)

This function performs at the same speed as process before it accepted any callback arguments. This means that the only overhead to SAX’ing will be the activities that the callback functions themselves do.

Specialization is for insects (and, as it happens, optimized programs)

One point of interest is that adding a SPECIALIZE pragma for the process function increases speed by roughly 1 μs. Specialization means that for a given function which is generic (type-class polymorphic), which means it will accept a dictionary argument at runtime for the particular instance, instead we will generate a separate piece of code that is specialized on that exact instance. Below is the Identity monad’s (i.e. just pure, does nothing) specialized type for process.

{-# SPECIALISE process :: (ByteString -> Identity ()) -> (ByteString -> ByteString -> Identity ()) -> (ByteString -> Identity ()) -> (ByteString -> Identity ()) -> (ByteString -> Identity ()) -> ByteString -> Identity () #-}


benchmarking 4KB/xeno-sax time 5.877 μs (5.837 μs .. 5.926 μs) benchmarking 211KB/xeno-sax time 285.8 μs (284.7 μs .. 287.4 μs)


benchmarking 4KB/xeno-sax time 5.046 μs (5.036 μs .. 5.056 μs) benchmarking 211KB/xeno-sax time 240.6 μs (240.0 μs .. 241.5 μs)

In the 4KB case it’s only 800 ns, but as we say in Britain, take care of the pennies and the pounds will look after themselves. The 240->285 difference isn’t big in practical terms, but when we’re playing the speed game, we pay attention to things like that.

Where we stand: Xeno vs Hexml

Currently the SAX interface in Zeno outperforms Hexml in space and time. Hurrah! We’re as fast as C!

File hexml-dom xeno-sax 4KB 6.134 μs 5.147 μs 31KB 9.299 μs 2.879 μs 211KB 257.3 μs 241.0 μs

It’s also worth noting that Haskell does this all safely. All the functions I’m using are standard ByteString functions which do bounds checking and throw an exception if so. We don’t accidentally access memory that we shouldn’t, and we don’t segfault. The server keeps running.

If you’re interested, if we switch to unsafe functions (unsafeTake, unsafeIndex from the Data.ByteString.Unsafe module), we get a notable speed increase:

File hexml-dom xeno-sax 4KB 6.134 μs 4.344 μs 31KB 9.299 μs 2.570 μs 211KB 257.3 μs 206.9 μs

We don’t need to show off, though. We’ve already made our point. We’re Haskellers, we like safety. I’ll keep my safe functions.

But Hexml does more!

I’d be remiss if I didn’t address the fact that Hexml does more useful things than we’ve done here. Hexml allocates a DOM for random access. Oh no! Allocation: Haskell’s worse enemy!

We’ve seen that Haskell allocates a lot normally. Actually, have we looked at that properly?

Case Bytes GCs Check 4kb/hexpat-sax 444,176 0 OK 31kb/hexpat-sax 492,576 0 OK 211kb/hexpat-sax 21,112,392 40 OK 4kb/hexpat-dom 519,128 0 OK 31kb/hexpat-dom 575,232 0 OK 211kb/hexpat-dom 23,182,560 44 OK


Implementing a DOM parser for Xeno

All isn’t lost. Hexml isn’t a dumb parser that’s fast because it’s in C, it’s also a decent algorithm. Rather than allocating a tree, it allocates a big flat vector of nodes and attributes, which contain offsets into the original string. We can do that in Haskell too!

Here’s my design of a data structure contained in a vector. We want to store just integers in the vector. Integers that point to offsets in the original string. Here’s what I came up with.

We have three kinds of payloads. Elements, text and attributes:

1. 00 # Type tag: element 2. 00 # Parent index (within this array) 3. 01 # Start of the tag name in the original string 4. 01 # Length of the tag name 5. 05 # End index of the tag (within this array) 1. 02 # Type tag: attribute 2. 01 # Start of the key 3. 05 # Length of the key 4. 06 # Start of the value 5. 03 # Length of the value 1. 01 # Type tag: text 2. 01 # Start of the text 3. 10 # Length of the text

That’s all the detail I’m going to go into. You can read the code if you want to know more. It’s not a highly optimized format. Once we have such a vector, it’s possible to define a DOM API on top of it which can let you navigate the tree as usual, which we’ll see later.

We’re going to use our SAX parser–the process function, and we’re going to implement a function that writes to a big array. This is a very imperative algorithm. Haskellers don’t like imperative algorithms much, but Haskell’s fine with them.

The function ends up looking something like this:

runST (do nil <- 1000 vecRef <- newSTRef nil sizeRef <- fmap asURef (newRef 0) parentRef <- fmap asURef (newRef 0) process (\(PS _ name_start name_len) -> <write the open tag elements>) (\(PS _ key_start key_len) (PS _ value_start value_len) -> <write an attribute into the vector>) (\_ -> <ignore>) (\(PS _ text_start text_len) -> <write a text entry into the vector>) (\_ -> <set the end position of the parent> <set the current element to the parent>) str wet <- readSTRef vecRef arr <- UV.unsafeFreeze wet size <- readRef sizeRef return (UV.unsafeSlice 0 size arr))

The function runs in the ST monad which lets us locally read and write to mutable variables and vectors, while staying pure on the outside.

I allocate an array of 1000 64-bit Ints (on 64-bit arch), I keep a variable of the current size, and the current parent (if any). The current parent variable lets us, upon seeing a </close> tag, assign the position in the vector of where the parent is closed.

Whenever we get an event and the array is too small, I grow the array by doubling its size. This strategy is copied from the Hexml package.

Finally, when we’re done, we get the mutable vector, “freeze” it (this means making an immutable version of it), and then return that copy. We use unsafeFreeze to re-use the array without copying, which includes a promise that we don’t use the mutable vector afterwards, which we don’t.

The DOM speed

Let’s take a look at the speeds:

File hexml-dom xeno-sax xeno-dom 4KB 6.123 μs 5.038 μs 10.35 μs 31KB 9.417 μs 2.875 μs 5.714 μs 211KB 256.3 μs 240.4 μs 514.2 μs

Not bad! The DOM parser is only <2x slower than Hexml (except in the 31KB where it’s faster. shrug). Here is where I stopped optimizing and decided it was good enough. But we can review some of the decisions made along the way.

In the code we’re using unboxed mutable references for the current size and parent, the mutable references are provided by the mutable-containers package. See these two lines here:

sizeRef <- fmap asURef (newRef 0) parentRef <- fmap asURef (newRef 0)

Originally, I had tried STRef’s, which are boxed. Boxed just means it’s a pointer to an integer instead of an actual integer. An unboxed Int is a proper machine register. Using an STRef, we get worse speeds:

File xeno-dom 4KB 12.18 μs 31KB 6.412 μs 211KB 631.1 μs

Which is a noticeable speed loss.

Another thing to take into consideration is the array type. I’m using the unboxed mutable vectors from the vector package. When using atomic types like Int, it can be a leg-up to use unboxed vectors. If I use the regular boxed vectors from Data.Vector, the speed regresses to:

File xeno-dom 4KB 11.95 μs (from 10.35 μs) 31KB 6.430 μs (from 5.714 μs) 211KB 1.402 ms (from 514.2 μs)

Aside from taking a bit more time to do writes, it also allocates 1.5x more stuff:

Case Bytes GCs Check 4kb/xeno/dom 11,240 0 OK 31kb/xeno/dom 10,232 0 OK 211kb/xeno/dom 1,082,696 0 OK


Case Bytes GCs Check 4kb/xeno/dom 22,816 0 OK 31kb/xeno/dom 14,968 0 OK 211kb/xeno/dom 1,638,392 1 OK

See that GC there? We shouldn’t need it.

Finally, one more remark for the DOM parser. If we forsake safety and use the unsafeWrite and unsafeRead methods from the vector package, we do see a small increase:

File xeno-dom 4KB 9.827 μs 31KB 5.545 μs 211KB 490.1 μs

But it’s nothing to write home about. I’ll prefer memory safety over a few microseconds this time.


I wrote some functions to access our vector and provide a DOM-like API:

> let Right node = parse "<foo k='123'><p>hi</p>ok</foo>" > node (Node "foo" [("k","123")] [Element (Node "p" [] [Text "hi"]),Text "ok"]) > name node "foo" > children node [(Node "p" [] [Text "hi"])] > attributes node [("k","123")] > contents node [Element (Node "p" [] [Text "hi"]),Text "ok"]

So that works.


The final results are in:

And just to check that a 1MB file doesn’t give wildly different results:

benchmarking 1MB/hexml-dom time 1.225 ms (1.221 ms .. 1.229 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.239 ms (1.234 ms .. 1.249 ms) std dev 25.23 μs (12.28 μs .. 40.84 μs) benchmarking 1MB/xeno-sax time 1.206 ms (1.203 ms .. 1.211 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 1.213 ms (1.210 ms .. 1.218 ms) std dev 14.58 μs (10.18 μs .. 21.34 μs) benchmarking 1MB/xeno-dom time 2.768 ms (2.756 ms .. 2.779 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 2.801 ms (2.791 ms .. 2.816 ms) std dev 41.10 μs (30.14 μs .. 62.60 μs)

Tada! We matched Hexml, in pure Haskell, using safe accessor functions. We provided a SAX API which is very fast, and a simple demonstration DOM parser with a familiar API which is also quite fast. We use reasonably little memory in doing so.

This package is an experiment for educational purposes, to show what Haskell can do and what it can’t, for a very specific domain problem. If you would like to use this package, consider adopting it and giving it a good home. I’m not looking for more packages to maintain.

Categories: Offsite Blogs

Roman Cheplyaka: Nested monadic loops may cause space leaks

Planet Haskell - Tue, 01/10/2017 - 2:00pm

Consider the following trivial Haskell program:

main :: IO () main = worker {-# NOINLINE worker #-} worker :: (Monad m) => m () worker = let loop = poll >> loop in loop poll :: (Monad m) => m a poll = return () >> poll

It doesn’t do much — except, as it turns out, eat a lot of memory!

% ./test +RTS -s & sleep 1s && kill -SIGINT %1 751,551,192 bytes allocated in the heap 1,359,059,768 bytes copied during GC 450,901,152 bytes maximum residency (11 sample(s)) 7,166,816 bytes maximum slop 888 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1429 colls, 0 par 0.265s 0.265s 0.0002s 0.0005s Gen 1 11 colls, 0 par 0.701s 0.703s 0.0639s 0.3266s INIT time 0.000s ( 0.000s elapsed) MUT time 0.218s ( 0.218s elapsed) GC time 0.966s ( 0.968s elapsed) EXIT time 0.036s ( 0.036s elapsed) Total time 1.223s ( 1.222s elapsed) %GC time 79.0% (79.2% elapsed) Alloc rate 3,450,267,071 bytes per MUT second Productivity 21.0% of total user, 21.0% of total elapsed

These nested loops happen often in server-side programming. About a year ago, when I worked for Signal Vine, this happened to my code: the inner loop was a big streaming computation; the outer loop was something that would restart the inner loop should it fail.

Later that year, Edsko de Vries blogged about a very similar issue.

Recently, Sean Clark Hess observed something similar. In his case, the inner loop waits for a particular AMQP message, and the outer loop calls the inner loop repeatedly to extract all such messages.

So why would such an innocent-looking piece of code consume unbounded amounts of memory? To find out, let’s trace the program execution on the STG level.

Background: STG and IO

The runtime model of ghc-compiled programs is described in the paper Making a Fast Curry: Push/Enter vs. Eval/Apply for Higher-order Languages. Here is the grammar and the reduction rules for the quick reference.



It is going to be important that the IO type in GHC is a function type:

newtype IO a = IO (State# RealWorld -> (# State# RealWorld, a #))

Here are a few good introductions to the internals of IO: from Edsko de Vries, Edward Z. Yang, and Michael Snoyman.

Our program in STG

Let’s see now how our program translates to STG. This is a translation done by ghc 8.0.1 with -O -ddump-stg -dsuppress-all:

poll_rnN = sat-only \r srt:SRT:[] [$dMonad_s312] let { sat_s314 = \u srt:SRT:[] [] poll_rnN $dMonad_s312; } in let { sat_s313 = \u srt:SRT:[] [] return $dMonad_s312 (); } in >> $dMonad_s312 sat_s313 sat_s314; worker = \r srt:SRT:[] [$dMonad_s315] let { loop_s316 = \u srt:SRT:[] [] let { sat_s317 = \u srt:SRT:[] [] poll_rnN $dMonad_s315; } in >> $dMonad_s315 sat_s317 loop_s316; } in loop_s316; main = \u srt:SRT:[r2 :-> $fMonadIO] [] worker $fMonadIO;

This is the STG as understood by ghc itself. In the notation of the fast curry paper introduced above, this (roughly) translates to:

main = THUNK(worker monadIO realWorld); worker = FUN(monad -> let { loop = THUNK(let {worker_poll_thunk = THUNK(poll monad);} in then monad worker_poll_thunk loop); } in loop ); poll = FUN(monad -> let { ret_thunk = THUNK(return monad unit); poll_poll_thunk = THUNK(poll monad); } in then monad ret_thunk poll_poll_thunk );

monadIO is the record (“dictionary”) that contains the Monad methods >>=, >>, and return for the IO type. We will need return and >> (called then here) in particular; here is how they are defined:

returnIO = FUN(x s -> (# s, x #)); thenIO = FUN(m k s -> case m s of { (# new_s, result #) -> k new_s } ); monadIO = CON(Monad returnIO thenIO); return = FUN(monad -> case monad of { Monad return then -> return } ); then = FUN(monad -> case monad of { Monad return then -> then } ); STG interpreters

We could run our STG program by hand following the reduction rules listed above. If you have never done it, I highly recommend performing several reductions by hand as an exercise. But it is a bit tedious and error-prone. That’s why we will use Bernie Pope’s Ministg interpreter. My fork of Ministg adds support for unboxed tuples and recursive let bindings necessary to run our program.

There is another STG interpreter, stgi, by David Luposchainsky. It is more recent and looks nicer, but it doesn’t support the eval/apply execution model used by ghc, which is a deal breaker for our purposes.

We run Ministg like this:

ministg --noprelude --trace --maxsteps=100 --style=EA --tracedir leak.trace leak.stg

Ministg will print an error message saying that the program hasn’t finished running in 100 steps — as we would expect, — and it will also generate a directory leak.trace containing html files. Each html file shows the state of the STG machine after a single evaluation step. You can browse these files here.

Tracing the program

Steps 0 through 16 take us from main to poll monadIO, which is where things get interesting, because from this point on, only code inside poll will be executing. Remember, poll is an infinite loop, so it won’t give a chance for worker to run ever again.

Each iteration of the poll loop consists of two phases. During the first phase, poll monadIO is evaluated. This is the “pure” part. No IO gets done during this part; we are just figuring out what is going to be executed. The first phase runs up until step 24.

On step 25, we grab the RealWorld token from the stack, and the second phase — the IO phase — begins. It ends on step 42, when the next iteration of the loop begins with poll monadIO.

Let’s look at the first phase in more detail. In steps 18 and 19, the let-expression

let { ret_thunk = THUNK(return monad unit); poll_poll_thunk = THUNK(poll monad); } in then monad ret_thunk poll_poll_thunk

is evaluated. The thunks ret_thunk and poll_poll_thunk are allocated on the heap at addresses $3 and $4, respectively.

Later these thunks will be evaluated/updated to partial applications: $3=PAP(returnIO unit) on step 35 and $4=PAP(thenIO $7 $8) on step 50.

We would hope that these partial applications will eventually be garbage-collected. Unfortunately, not. The partial application $1=PAP(thenIO $3 $4) is defined in terms of $3 and $4. $1 is the worker_poll_thunk, the “next” instance of the poll loop invoked by worker.

This is why the leak doesn’t occur if there’s no outer loop. Nothing would reference $3 and $4, and they would be executed and gc’d.

IO that doesn’t leak

The memory leak is a combination of two reasons. As we discussed above, the first reason is the outer loop that holds on to the reference to the inner loop.

The second reason is that IO happens here in two phases: the pure phase, during which we “compute” the IO action, and the second phase, during which we run the computed action. If there was no first phase, there would be nothing to remember.

Consider this version of the nested loop. Here, I moved NOINLINE to poll. (NOINLINE is needed because otherwise ghc would realize that our program doesn’t do anything and would simplify it down to a single infinite loop.)

main :: IO () main = worker worker :: (Monad m) => m () worker = let loop = poll >> loop in loop {-# NOINLINE poll #-} poll :: (Monad m) => m a poll = return () >> poll

In this version, ghc would inline worker into main and specialize it to IO. Here is the ghc’s STG code:

poll_rqk = sat-only \r srt:SRT:[] [$dMonad_s322] let { sat_s324 = \u srt:SRT:[] [] poll_rqk $dMonad_s322; } in let { sat_s323 = \u srt:SRT:[] [] return $dMonad_s322 (); } in >> $dMonad_s322 sat_s323 sat_s324; main1 = \r srt:SRT:[r3 :-> main1, r54 :-> $fMonadIO] [s_s325] case poll_rqk $fMonadIO s_s325 of _ { (#,#) ipv_s327 _ -> main1 ipv_s327; };

Here, poll still runs in two phases, but main1 (the outer loop) doesn’t. This program still allocates memory and runs not as efficient as it could, but at least it runs in constant memory. This is because the compiler realizes that poll_rqk $fMonadIO is not computing anything useful and there’s no point in caching that value. (I am actually curious what exactly ghc’s logic is here.)

What if we push NOINLINE even further down?

main :: IO () main = worker worker :: (Monad m) => m () worker = let loop = poll >> loop in loop poll :: (Monad m) => m a poll = do_stuff >> poll {-# NOINLINE do_stuff #-} do_stuff :: Monad m => m () do_stuff = return ()


do_stuff_rql = sat-only \r srt:SRT:[] [$dMonad_s32i] return $dMonad_s32i (); $spoll_r2SR = sat-only \r srt:SRT:[r54 :-> $fMonadIO, r2SR :-> $spoll_r2SR] [s_s32j] case do_stuff_rql $fMonadIO s_s32j of _ { (#,#) ipv_s32l _ -> $spoll_r2SR ipv_s32l; }; main1 = \r srt:SRT:[r3 :-> main1, r2SR :-> $spoll_r2SR] [s_s32n] case $spoll_r2SR s_s32n of _ { (#,#) ipv_s32p _ -> main1 ipv_s32p; };

This code runs very efficiently, in a single phase, and doesn’t allocate at all.

Of course, in practice we wouldn’t deliberately put these NOINLINEs in our code just to make it inefficient. Instead, the inlining or specialization will fail to happen because the function is too big and/or resides in a different module, or for some other reason.


Arities provide an important perspective on the two-phase computation issue. The arity of then is 1: it is just a record selector. The arity of thenIO is 3: it takes the two monadic values and the RealWorld state token.

Arities influence what happens at runtime, as can be seen from the STG reduction rules. Because thenIO has arity 3, a partial application is created for thenIO ret_thunk poll_poll_thunk. Let’s change the arity of thenIO to 2, so that no PAPs get created:

thenIO = FUN(m k -> case m realWorld of { (# new_s, result #) -> k } );

(this is similar to how unsafePerformIO works). Now we no longer have PAPs, but our heap is filled with the same exact number of BLACKHOLEs.

More importantly, arities also influence what happens during compile time: what shape the generated STG code has. Because then has arity 1, ghc decides to create a chain of thens before passing the RealWorld token. Let’s change (“eta-expand”) the poll code as if then had arity 4, without actually changing then or thenIO or their runtime arities:

# added a dummy argument s poll = FUN(monad s -> let { ret_thunk = THUNK(return monad unit); poll_poll_thunk = THUNK(poll monad); } in then monad ret_thunk poll_poll_thunk s ); # no change in then or thenIO then = FUN(monad -> case monad of { Monad return then -> then } ); thenIO = FUN(m k s -> case m s of { (# new_s, result #) -> k new_s } );

This code now runs in constant memory!

Therefore, what inlining/specialization does is that it lets the compiler to see the true arity of a function such as then. (Of course, it would also allow the compiler to replace then with thenIO.)


Let me tell you how you can avoid any such space leaks in your code by following a simple rule:

I don’t know.

In some cases, -fno-full-laziness or -fno-state-hack help. In this case, they don’t.

In 2012, I wrote why reasoning about space usage in Haskell is hard. I don’t think anything has changed since then. It is a hard problem to solve. I filed a ghc bug #13080 just in case the ghc developers might figure out a way how to address this particular issue.

Most of the time everything works great, but once in a while you stumble upon something like this. Such is life.

Thanks to Reid Barton for pointing out that my original theory regarding this leak was incomplete at best.

Categories: Offsite Blogs

Michael Snoyman: Foldable.mapM_, Maybe, and recursive functions

Planet Haskell - Mon, 01/09/2017 - 6:00pm

NOTE This content originally appeared on School of Haskell.

I've run into this issue myself, and seen others hit it too. Let's start off with some very simple code:

#!/usr/bin/env stack -- stack --resolver lts-7.14 --install-ghc runghc sayHi :: Maybe String -> IO () sayHi mname = case mname of Nothing -> return () Just name -> putStrLn $ "Hello, " ++ name main :: IO () main = sayHi $ Just "Alice"

There's nothing amazing about this code, it's pretty straight-forward pattern matching Haskell. And at some point, many Haskellers end up deciding that they don't like the explicit pattern matching, and instead want to use a combinator. So the code above might get turned into one of the following:

#!/usr/bin/env stack -- stack --resolver lts-7.14 --install-ghc runghc import Data.Foldable (forM_) hiHelper :: String -> IO () hiHelper name = putStrLn $ "Hello, " ++ name sayHi1 :: Maybe String -> IO () sayHi1 = maybe (return ()) hiHelper sayHi2 :: Maybe String -> IO () sayHi2 = mapM_ hiHelper main :: IO () main = do sayHi1 $ Just "Alice" sayHi2 $ Just "Bob" -- or often times this: forM_ (Just "Charlie") hiHelper

The theory is that all three approaches (maybe, mapM_, and forM_) will end up being identical. We can fairly conclusively state that forM_ will be the exact same thing as mapM_, since it's just mapM_ flipped. So the question is: will the maybe and mapM_ approaches do the same thing? In this case, the answer is yes, but let's spice it up a bit more. First, the maybe version:

#!/usr/bin/env stack -- stack --resolver lts-7.14 --install-ghc exec -- ghc -with-rtsopts -s import Control.Monad (when) uncons :: [a] -> Maybe (a, [a]) uncons [] = Nothing uncons (x:xs) = Just (x, xs) printChars :: Int -> [Char] -> IO () printChars idx str = maybe (return ()) (\(c, str') -> do when (idx `mod` 100000 == 0) $ putStrLn $ "Character #" ++ show idx ++ ": " ++ show c printChars (idx + 1) str') (uncons str) main :: IO () main = printChars 1 $ replicate 5000000 'x'

You can compile and run this by saving to a Main.hs file and running stack Main.hs && ./Main. On my system, it prints out the following memory statistics, which from the maximum residency you can see runs in constant space:

2,200,270,200 bytes allocated in the heap 788,296 bytes copied during GC 44,384 bytes maximum residency (2 sample(s)) 24,528 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)

While constant space is good, the usage of maybe makes this a bit ugly. This is a common time to use forM_ to syntactically clean things up. So let's give that a shot:

#!/usr/bin/env stack -- stack --resolver lts-7.14 --install-ghc exec -- ghc -with-rtsopts -s import Control.Monad (when, forM_) uncons :: [a] -> Maybe (a, [a]) uncons [] = Nothing uncons (x:xs) = Just (x, xs) printChars :: Int -> [Char] -> IO () printChars idx str = forM_ (uncons str) $ \(c, str') -> do when (idx `mod` 100000 == 0) $ putStrLn $ "Character #" ++ show idx ++ ": " ++ show c printChars (idx + 1) str' main :: IO () main = printChars 1 $ replicate 5000000 'x'

The code is arguablycleaner and easier to follow. However, when I run it, I get the following memory stats:

3,443,468,248 bytes allocated in the heap 632,375,152 bytes copied during GC 132,575,648 bytes maximum residency (11 sample(s)) 2,348,288 bytes maximum slop 331 MB total memory in use (0 MB lost due to fragmentation)

Notice how max residency has balooned up from 42kb to 132mb! And if you increase the size of the generated list, that number grows. In other words: we have linear memory usage instead of constant, clearer something we want to avoid.

The issue is that the implementation of mapM_ in Data.Foldable is not tail recursive, at least for the case of Maybe. As a result, each recursive call ends up accumulating a bunch of "do nothing" actions to perform after completing the recursive call, which all remain resident in memory until the entire list is traversed.

Fortunately, solving this issue is pretty easy: write a tail-recursive version of forM_ for Maybe:

#!/usr/bin/env stack -- stack --resolver lts-7.14 --install-ghc exec -- ghc -with-rtsopts -s import Control.Monad (when) uncons :: [a] -> Maybe (a, [a]) uncons [] = Nothing uncons (x:xs) = Just (x, xs) forM_Maybe :: Monad m => Maybe a -> (a -> m ()) -> m () forM_Maybe Nothing _ = return () forM_Maybe (Just x) f = f x printChars :: Int -> [Char] -> IO () printChars idx str = forM_Maybe (uncons str) $ \(c, str') -> do when (idx `mod` 100000 == 0) $ putStrLn $ "Character #" ++ show idx ++ ": " ++ show c printChars (idx + 1) str' main :: IO () main = printChars 1 $ replicate 5000000 'x'

This implementation once again runs in constant memory.

There's one slight difference in the type of forM_Maybe and forM_ specialized to Maybe. The former takes a second argument of type a -> m (), while the latter takes a second argument of type a -> m b. This difference is unfortunately necessary; if we try to get back the original type signature, we have to add an extra action to wipe out the return value, which again reintroduces the memory leak:

forM_Maybe :: Monad m => Maybe a -> (a -> m b) -> m () forM_Maybe Nothing _ = return () forM_Maybe (Just x) f = f x >> return ()

Try swapping in this implementation into the above program, and once again you'll get your memory leak.


Back in 2014, I raised this same issue about the mono-traversable library, and ultimately decided to change the type signature of the omapM_ function to the non-overflowing demonstrated above. You can see that this in fact works:

#!/usr/bin/env stack -- stack --resolver lts-7.14 --install-ghc exec --package mono-traversable -- ghc -with-rtsopts -s import Control.Monad (when) import Data.MonoTraversable (oforM_) uncons :: [a] -> Maybe (a, [a]) uncons [] = Nothing uncons (x:xs) = Just (x, xs) printChars :: Int -> [Char] -> IO () printChars idx str = oforM_ (uncons str) $ \(c, str') -> do when (idx `mod` 100000 == 0) $ putStrLn $ "Character #" ++ show idx ++ ": " ++ show c printChars (idx + 1) str' main :: IO () main = printChars 1 $ replicate 5000000 'x'

As we'd hope, this runs in constant memory.

Categories: Offsite Blogs

Dan Piponi (sigfpe): Building free arrows from components

Planet Haskell - Mon, 01/09/2017 - 10:33am


Gabriel Gonzalez has written quite a bit about the practical applications of free monads. And "haoformayor" wrote a great stackoverflow post on how arrows are related to strong profunctors. So I thought I'd combine these and apply them to arrows built from profunctors: free arrows. What you get is a way to use arrow notation to build programs, but defer the interpretation of those programs until later.


Using the notation here I'm going to call an element of a type P a b, where P is a profunctor, a heteromorphism.

A product that isn't much of a product

As I described a while back you can compose profunctors. Take a look at the code I used, and also Data.Functor.Composition.

data Compose f g d c = forall a. Compose (f d a) (g a c)

An element of Compose f g d c is just a pair of heteromorphisms, one from each of the profunctors, f and g, with the proviso that the "output" type of one is compatible with the "input" type of the other. As products go it's pretty weak in the sense that no composition happens beyond the two objects being stored with each other. And that's the basis of what I'm going to talk about. The Compose type is just a placeholder for pairs of heteromorphisms whose actual "multiplication" is being deferred until later. This is similar to the situation with the free monoid, otherwise known as a list. We can "multiply" two lists together using mappend but all that really does is combine the elements into a bigger list. The elements themselves aren't touched in any way. That suggests the idea of using profunctor composition in the same way that (:) is used to pair elements and lists.

Free Arrows

Here's some code:

> {-# OPTIONS -W #-}
> {-# LANGUAGE ExistentialQuantification #-}
> {-# LANGUAGE Arrows #-}
> {-# LANGUAGE RankNTypes #-}
> {-# LANGUAGE TypeOperators #-}
> {-# LANGUAGE FlexibleInstances #-}

> import Prelude hiding ((.), id)
> import Control.Arrow
> import Control.Category
> import Data.Profunctor
> import Data.Monoid

> infixr :-

> data FreeA p a b = PureP (a -> b)
> | forall x. p a x :- FreeA p x b

First look at the second line of the definition of FreeA. It says that a FreeA p a b might be a pair consisting of a head heteromorphism whose output matches the input of another FreeA. There's also the PureP case which is acting like the empty list []. The reason we use this is that for our composition, (->) acts a lot like the identity. In particular Composition (->) p a b is isomorphic to p a b (modulo all the usual stuff about non-terminating computations and so on). This is because an element of this type is a pair consisting of a function a -> x and a heteromorphism p x b for some type x we don't get to see. We can't project back out either of these items without information about the type of x escaping. So the only thing we can possibly do is use lmap to apply the function to the heteromorphism giving us an element of p a b.

Here is a special case of PureP we'll use later:

> nil :: Profunctor p => FreeA p a a
> nil = PureP id

So an element of FreeA is a sequence of heteromorphisms. If heteromorphisms are thought of as operations of some sort, then an element of FreeA is a sequence of operations waiting to be composed together into a program that does something. And that's just like the situation with free monads. Once we've build a free monad structure we apply an interpreter to it to evaluate it. This allows us to separate the "pure" structure representing what we want to do from the code that actually does it.

The first thing to note is our new type is also a profunctor. We can apply lmap and rmap to a PureP function straightforwardly. We apply lmap directly to the head of the list and we use recursion to apply rmap to the PureP at the end:

> instance Profunctor b => Profunctor (FreeA b) where
> lmap f (PureP g) = PureP (g . f)
> lmap f (g :- h) = (lmap f g) :- h
> rmap f (PureP g) = PureP (f . g)
> rmap f (g :- h) = g :- (rmap f h)

We also get a strong profunctor by applying first' all the way down the list:

> instance Strong p => Strong (FreeA p) where
> first' (PureP f) = PureP (first' f)
> first' (f :- g) = (first' f) :- (first' g)

We can now concatenate our lists of heteromorphisms using code that looks a lot like the typical implementation of (++):

> instance Profunctor p => Category (FreeA p) where
> id = PureP id
> g . PureP f = lmap f g
> k . (g :- h) = g :- (k . h)

Note that it's slightly different to what you might have expected compared to (++) because we tend to write composition of functions "backwards". Additionally, there is another definition of FreeA we could have used that's analogous to using snoc lists instead of cons lists.

And now we have an arrow. I'll leave the proofs that the arrow laws are obeyed as an exercise :-)

> instance (Profunctor p, Strong p) => Arrow (FreeA p) where
> arr = PureP
> first = first'

The important thing about free things is that we can apply interpreters to them. For lists we have folds:

foldr :: (a -> b -> b) -> b -> [a] -> b

In foldr f e we can think of f as saying how (:) should be interpreted and e as saying how [] should be interpreted.

Analogously, in Control.Monad.Free in the free package we have:

foldFree :: Monad m => (forall x . f x -> m x) -> Free f a -> m a
foldFree _ (Pure a) = return a
foldFree f (Free as) = f as >>= foldFree f

Given a natural transformation from f to m, foldFree extends it to all of Free f.

Now we need a fold for free arrows:

> foldFreeA :: (Profunctor p, Arrow a) =>
> (forall b c.p b c -> a b c) -> FreeA p b c -> a b c
> foldFreeA _ (PureP g) = arr g
> foldFreeA f (g :- h) = foldFreeA f h . f g

It's a lot like an ordinary fold but uses the arrow composition law to combine the interpretation of the head with the interpretation of the tail.

"Electronic" components

Let me revisit the example from my previous article. I'm going to remove things I won't need so my definition of Circuit is less general here. Free arrows are going to allow us to define individual components for a circuit, but defer exactly how those components are interpreted until later.

I'll use four components this time: a register we can read from, one we can write from and a register incrementer, as well as a "pure" component. But before that, let's revisit Gabriel's article that gives some clues about how components should be built. In particular, look at the definition of TeletypeF:

data TeletypeF x
= PutStrLn String x
| GetLine (String -> x)
| ExitSuccess

We use GetLine to read a string, and yet the type of GetLine k could be TeletypeF a for any a. The reason is that free monads work with continuations. Instead of GetLine returning a string to us, it's a holder for a function that says what we'd like to do with the string once we have it. That means we can leave open the question of where the string comes from. The function foldFree can be used to provide the actual string getter.

Free arrows are like "two-sided" free monads. We don't just provide a continuation saying what we'd like to do to our output. We also get to say how we prepare our data for input.

There's also some burden put on us. Free arrows need strong profunctors. Strong profunctors need to be able to convey extra data alongside the data we care about - that's what first' is all about. This means that even though Load is functionally similar to GetLine, it can't simply ignore its input. So we don't have Load (Int -> b), and instead have Load ((a, Int) -> b. Here is our component type:

> data Component a b = Load ((a, Int) -> b)
> | Store (a -> (b, Int))
> | Inc (a -> b)

The Component only knows about the data passing through, of type a and b. It doesn't know anything about how the data in the registers is stored. That's the part that will be deferred to later. We intend for Inc to increment a register. But as it doesn't know anything about registers nothing in the type of Inc refers to that. (It took a bit of experimentation for me to figure this out and there may be other ways of doing things. Often with code guided by category theory you can just "follow your nose" as there's one way that works and type checks. Here I found a certain amount of flexibility in how much you store in the Component and how much is deferred to the interpreter.)

I could implement the strong profunctor instances using various combinators but I think it might be easiest to understand when written explicitly with lambdas:

> instance Profunctor Component where
> lmap f (Load g) = Load $ \(a, s) -> g (f a, s)
> lmap f (Store g) = Store (g . f)
> lmap f (Inc g) = Inc (g . f)

> rmap f (Load g) = Load (f . g)
> rmap f (Store g) = Store $ \a -> let (b, t) = g a
> in (f b, t)
> rmap f (Inc g) = Inc (f . g)

> instance Strong Component where
> first' (Load g) = Load $ \((a, x), s) -> (g (a, s), x)
> first' (Store g) = Store $ \(a, x) -> let (b, t) = g a
> in ((b, x), t)
> first' (Inc g) = Inc (first' g)

And now we can implement individual components. First a completely "pure" component:

> add :: Num a => FreeA Component (a, a) a
> add = PureP $ uncurry (+)

And now the load and store operations.

> load :: FreeA Component () Int
> load = Load (\(_, a) -> a) :- nil

> store :: FreeA Component Int ()
> store = Store (\a -> ((), a)) :- nil

> inc :: FreeA Component a a
> inc = Inc id :- nil

Finally we can tie it all together in a complete function using arrow notation:

> test = proc () -> do
> () <- inc -< ()
> a <- load -< ()
> b <- load -< ()
> c <- add -< (a, b)
> () <- store -< c

> returnA -< ()

At this point, the test object is just a list of operations waiting to be executed. Now I'll give three examples of semantics we could provide. The first uses a state arrow type similar to the previous article:

> newtype Circuit s a b = C { runC :: (a, s) -> (b, s) }

> instance Category (Circuit s) where
> id = C id
> C f . C g = C (f . g)

> instance Arrow (Circuit s) where
> arr f = C $ \(a, s) -> (f a, s)
> first (C g) = C $ \((a, x), s) -> let (b, t) = g (a, s)
> in ((b, x), t)

Here is an interpreter that interprets each of our components as an arrow. Note that this is where, among other things, we provide the meaning of the Inc operation:

> exec :: Component a b -> Circuit Int a b
> exec (Load g) = C $ \(a, s) -> (g (a, s), s)
> exec (Store g) = C $ \(a, _) -> g a
> exec (Inc g) = C $ \(a, s) -> (g a, s+1)

Here's a completely different interpreter that is going to make you do the work of maintaining the state used by the resgisters. You'll be told what to do! We'll use the Kleisli IO arrow to do the I/O.

> exec' :: Component a b -> Kleisli IO a b
> exec' (Load g) = Kleisli $ \a -> do
> putStrLn "What is your number now?"
> s <- fmap read getLine
> return $ g (a, s)
> exec' (Store g) = Kleisli $ \a -> do
> let (b, t) = g a
> putStrLn $ "Your number is now " ++ show t ++ "."
> return b
> exec' (Inc g) = Kleisli $ \a -> do
> putStrLn "Increment your number."
> return $ g a

The last interpreter is simply going to sum values associated to various components. They could be costs in dollars, time to execute, or even strings representing some kind of simple execution trace.

> newtype Labelled m a b = Labelled { unLabelled :: m }

> instance Monoid m => Category (Labelled m) where
> id = Labelled mempty
> Labelled a . Labelled b = Labelled (a `mappend` b)

> instance Monoid m => Arrow (Labelled m) where
> arr _ = Labelled mempty
> first (Labelled m) = Labelled m

> exec'' (Load _) = Labelled (Sum 1)
> exec'' (Store _) = Labelled (Sum 1)
> exec'' (Inc _) = Labelled (Sum 2)

Note that we can't assign non-trivial values to "pure" operations.

And now we execute all three:

> main = do
> print $ runC (foldFreeA exec test) ((), 10)
> putStrLn "Your number is 10." >> runKleisli (foldFreeA exec' test) ()
> print $ getSum $ unLabelled $ foldFreeA exec'' test

Various thoughts

I don't know if free arrows are anywhere near as useful as free monads, but I hope I've successfully illustrated one application. Note that because arrow composition is essentially list concatenation it may be more efficient to use a version of Hughes lists. This is what the Cayley representation is about in the monoid notions paper. But it's easier to see the naive list version first. Something missing from here that is essential for electronics simulation is the possibility of using loops. I haven't yet thought too much about what it means to build instances of ArrowLoop freely.

Profunctors have been described as decategorised matrices in the sense that p a b, with p a profunctor, is similar to the matrix . Or, if you're working in a context where you distinguish between co- and contravariant vectors, it's similar to . The Composition operation is a lot like the definition of matrix product. From this perspective, the FreeA operation is a lot like the function on matrices that takes to . To work with ArrowLoop we need a trace-like operation.

One nice application of free monads is in writing plugin APIs. Users can write plugins that link to a small library based on a free monad. These can then be dynamically loaded and interpreted by an application at runtime, completely insulating the plugin-writer from the details of the application. You can think of it as a Haskell version of the PIMPL idiom. Free arrows might give a nice way to write plugins for dataflow applications.

People typically think of functors as containers. So in a free monad, each element is a container of possible futures. In a free arrow the relationship between the current heteromorphism and its "future" (and "past") is a bit more symmetrical. For example, for some definitions of P, a heteromorphism P a b can act on some as to give us some bs. But some definitions of P can run "backwards" and act on elements of b -> r to give us elements of a -> r. So when I use the words "input" and "output" above, you might not want to take them too literally.

Categories: Offsite Blogs