# News aggregator

### gist.github.com

### Ulisses Costa: Pointfree Transform

### Edward Z. Yang: So you want to add a new concurrency primitive to GHC…

One of the appealing things about GHC is that the compiler is surprisingly hackable, even when you don’t want to patch the compiler itself. This hackability comes from compiler plugins, which let you write custom optimization passes on Core, as well as foreign primops, which let you embed low-level C-- to manipulate the low-level representation of various primitives. These hooks let people implement and distribute features that would otherwise be to unstable or speculative to put into the compiler proper.

A particular use-case that has garnered some amount of interest recently is that of concurrency primitives. We engineers like to joke that, in the name of performance, we are willing to take on nearly unbounded levels of complexity: but this is almost certainly true when it comes to concurrency primitives, where the use of ever more exotic memory barriers and concurrent data structures can lead to significant performance boosts (just ask the Linux kernel developers). It’s very tempting to look at this situation and think, “Hey, we could implement this stuff in GHC too, using the provided compiler hooks!” But there are a lot of caveats involved here.

After answering a few questions related to this subject on the ghc-devs list and noticing that many of the other responses were a bit garbled, I figured I ought to expand on my responses a bit in a proper blog post. I want to answer the following questions:

- What does it mean to have a memory model for a high-level language like Haskell? (You can safely skip this section if you know what a memory model is.)
- What is (GHC) Haskell’s memory model?
- How would I go about implementing a (fast) memory barrier in GHC Haskell?

What is a memory model? If you ask a hardware person, they might tell you, “A memory model is a description of how a multi-processor CPU interacts with its memory, e.g. under what circumstances a write by one processor is guaranteed to be visible by another.” If you ask a compiler person, they might tell you, “A memory model says what kind of compiler optimizations I’m allowed to do on operations which modify shared variables.” A memory model must fulfill both purposes (a common misconception is that it is only one or the other). To be explicit, we define a memory model as follows (adapted from Adve-Boehm):

A**memory model**is a

*semantics*for shared variables, i.e. the set of values that a read in a program is allowed to return.

That’s right: a memory model defines the behavior of one the most basic operations in your programming language. Without it, you can’t really say what your program is supposed to do.

Why, then, are memory models so rarely discussed, even in a language community that is so crazy about semantics? In the absence of concurrency, the memory model is irrelevant: the obvious semantics apply. In the absence of data races, the memory model can be described quite simply. For example, a Haskell program which utilizes only MVars for inter-thread communication can have its behavior described completely using a relatively simple nondeterministic operational semantics (see Concurrent Haskell paper (PS)); software transactional memory offers high-level guarantees of atomicity with respect to reads of transactional variables. Where a memory model becomes essential is when programs contain data races: when you have multiple threads writing and reading IORefs without any synchronization, a memory model is responsible for defining the behavior of this program. With modern processors, this behavior can be quite complex: we refer to these models as *relaxed memory models*. Sophisticated synchronization primitives will often take advantage of a relaxed memory model to avoid expensive synchronizations and squeeze out extra performance.

One might say the Haskell tradition is one that emphasizes the importance of semantics... except for a number of notable blind spots. The memory model is one of those blind spots. The original Haskell98 specification did not contain any specification of concurrency. Concurrent Haskell paper (PS) gave a description of semantics for how concurrency might be added to the language, but the paper posits only the existence of MVars, and is silent on how MVars ought to interact with IORefs.

One of the very first discussions that took place on the haskell-prime committee when it was inaugurated in 2006 was whether or not Concurrent Haskell should be standardized. In the discussion, it was quickly discovered that a memory model for IORefs would be needed (continued here). As of writing, no decision has been made as to whether or not IORefs should have a strong or weak memory model.

The upshot is that, as far as Haskell the standardized language goes, the behavior here is completely undefined. To really be able to say anything, we’ll have to pick an implementation (GHC Haskell), and we’ll have to infer which aspects of the implementation are specified behavior, as opposed to things that just accidentally happen to hold. Notably, memory models have implications for *all* levels of your stack (it is a common misconception that a memory barrier can be used without any cooperation from your compiler), so to do this analysis we’ll need to look at all of the phases of the GHC compilation chain. Furthermore, we’ll restrict ourselves to monadic reads/writes, to avoid having to wrangle with the can of worms that is laziness.

Here’s GHC’s compilation pipeline in a nutshell:

At the very top of the compiler pipeline lie the intermediate languages Core and STG. These will preserve sequential consistency with no trouble, as the ordering of reads and writes is fixed by the use of monads, and preserved throughout the desugaring and optimization passes: as far as the optimizer is concerned, the primitive operations which implement read/write are complete black boxes. In fact, monads will over-sequentialize in many cases! (It is worth remarking that rewrite rules and GHC plugins could apply optimizations which do not preserve the ordering imposed by monads. Of course, both of these facilities can be used to also change the meaning of your program entirely; when considering a memory model, these rules merely have a higher burden of correctness.)

The next step of the pipeline is a translation into C--, a high-level assembly language. Here, calls to primitive operations like readMutVar# and writeMutVar# are translated into actual memory reads and writes in C--. Importantly, the monadic structure that was present in Core and STG is now eliminated, and GHC may now apply optimizations which reorder reads and writes. What actually occurs is highly dependent on the C-- that is generated, as well as the optimizations that GHC applies, and C-- has no memory model, so we cannot appeal to even that.

This being said, a few things can be inferred from a study of the optimization passes that GHC does implement:

- GHC reserves the right to reorder stores: the WriteBarrier mach-op (NB: not available from Haskell!) is defined to prevent future stores from occurring before preceding stores. In practice, GHC has not implemented any C-- optimizations which reorder stores, so if you have a story for dealing with the proceeding stages of the pipeline, you can
**dangerously**assume that stores will not be reordered in this phase. - GHC reserves the right to reorder loads, and does so extensively. One of the most important optimizations we perform is a sinking pass, where assignments to local variables are floated as close to their use-sites as possible. As of writing, there is no support for read barrier, which would prevent this floating from occurring.

There are a few situations where we happen to avoid read reordering (which may be **dangerously** assumed):

- Reads don’t seem to be reordered across
*foreign primops*(primops defined using the foreign prim keywords). This is because foreign primops are implemented as a jump to another procedure (the primop), and there are no inter-procedural C-- optimizations at present. - Heap reads don’t seem to be reordered across heap writes. This is because we currently don’t do any aliasing analysis and conservatively assume the write would have clobbered the read. (This is especially dangerous to assume, since you could easily imagine getting some aliasing information from the frontend.)

Finally, the C-- is translated into either assembly (via the NCG—N for native) or to LLVM. During translation, we convert the write-barrier mach-op into an appropriate assembly instruction (no-op on x86) or LLVM intrinsic (sequential consistency barrier); at this point, the behavior is up to the memory model defined by the processor and/or by LLVM.

It is worth summarizing the discussion here by comparing it to the documentation at Data.IORef, which gives an informal description of the IORef memory model:

In a concurrent program, IORef operations may appear out-of-order to another thread, depending on the memory model of the underlying processor architecture...The implementation is required to ensure that reordering of memory operations cannot cause type-correct code to go wrong. In particular, when inspecting the value read from an IORef, the memory writes that created that value must have occurred from the point of view of the current thread.In other words, “We give no guarantees about reordering, except that you will not have any type-safety violations.” This behavior can easily occur as a result of reordering stores or loads. However, the type-safety guarantee is an interesting one: the last sentence remarks that an IORef is not allowed to point to uninitialized memory; that is, we’re not allowed to reorder the write to the IORef with the write that initializes a value. This holds easily on x86, due to the fact that C-- does not reorder stores; I am honestly skeptical that we are doing the right thing on the new code generator for ARM (but no one has submitted a bug yet!)

What does it all mean?This dive into the gory internals of GHC is all fine and nice, but what does it mean for you, the prospective implementor of a snazzy new concurrent data structure? There are three main points:

- Without inline foreign primops, you will not be able to convince GHC to emit the fast-path assembly code you are looking for. As we mentioned earlier, foreign primops currently always compile into out-of-line jumps, which will result in a bit of extra cost if the branch predictor is unable to figure out the control flow. On the plus side, any foreign primop call will accidentally enforce the compiler-side write/read barrier you are looking for.
- With inline foreign primops, you will still need make modifications to GHC in order to ensure that optimization passes respect your snazzy new memory barriers. For example, John Lato’s desire for a load-load barrier (the email which kicked off this post) will be fulfilled with no compiler changes by a out-of-line foreign primop, but not by the hypothetical inline foreign primop.
- This stuff is really subtle; see the position paper Relaxed memory models must be rigorous, which argues that informal descriptions of memory models (like this blog post!) are far too vague to be useful: if you want to have any hope of being correct, you must formalize it! Which suggests an immediate first step: give C-- a memory model. (This should be a modest innovation over the memory models that C and C++ have recently received.)

For the rest of us, we’ll use STM instead, and be in a slow but compositional and dead-lock free nirvana.

### Standard way to translate Haskell to other languages?

### Get expected type of expression using GHC API?

### Yesod Web Framework: Lens generator in Persistent

I've just released version 1.3.1 of persistent-template, which adds a new options: mpsGenerateLenses. When enabled, this option will cause lenses to be generated instead of normal field accessors. An example is worth a thousand words:

{-# LANGUAGE EmptyDataDecls #-} {-# LANGUAGE FlexibleContexts #-} {-# LANGUAGE GADTs #-} {-# LANGUAGE OverloadedStrings #-} {-# LANGUAGE QuasiQuotes #-} {-# LANGUAGE TemplateHaskell #-} {-# LANGUAGE TypeFamilies #-} import Control.Monad.IO.Class (liftIO) import Database.Persist import Database.Persist.Sqlite import Database.Persist.TH import Control.Lens share [ mkPersist sqlSettings { mpsGenerateLenses = True } , mkMigrate "migrateAll" ] [persistLowerCase| Person name String age Int Maybe deriving Show |] main :: IO () main = runSqlite ":memory:" $ do runMigration migrateAll johnId <- insert $ Person "John Doe" $ Just 35 Just john <- get johnId liftIO $ putStrLn $ john ^. personName janeId <- insert $ john & personName .~ "Jane Joe" Just jane <- get janeId liftIO $ print janeNot a major feature, but convenient. It can also be combined with the relatively new mpsPrefixFields option if your field names are unique, e.g.:

share [ mkPersist sqlSettings { mpsGenerateLenses = True, mpsPrefixFields = False } , mkMigrate "migrateAll" ] [persistLowerCase| Person name String age Int Maybe deriving Show |] main :: IO () main = runSqlite ":memory:" $ do runMigration migrateAll johnId <- insert $ Person "John Doe" $ Just 35 Just john <- get johnId liftIO $ putStrLn $ john ^. name janeId <- insert $ john & name .~ "Jane Joe" Just jane <- get janeId liftIO $ print janeNote that this isn't the first use of lenses in Persistent. While this implementation is fully compatible with the lens package, it introduces no dependency on that package, due to the beautiful way in which lenses work.

### Jeremy Gibbons: Upwards and downwards accumulations

Continuing my work in regress, this post revisits—with the benefit of much hindsight—what I was working on for my DPhil thesis (which was summarized in a paper at MPC 1992) and in subsequent papers at MPC 1998 and in SCP in 2000. This is the topic of *accumulations* on data structures, which distribute information across the data structure. List instances are familiar from the Haskell standard libraries (and, to those with a long memory, from APL); my thesis presented instances for a variety of tree datatypes; and the later work was about making it datatype-generic. I now have a much better way of doing it, using Conor McBride’s *derivatives*.

*Accumulations* or *scans* distribute information contained in a data structure across that data structure in a given direction. The paradigmatic example is computing the running totals of a list of numbers, which can be thought of as distributing the numbers rightwards across the list, summing them as you go. In Haskell, this is an instance of the operator:

A special case of this pattern is to distribute the elements of a list rightwards across the list, simply collecting them as you go, rather than summing them. That’s the function, and it too is an instance of :

It’s particularly special, in the sense that it is the most basic ; any other instance can be expressed in terms of it:

This is called the *Scan Lemma* for . Roughly speaking, it states that a replaces every node of a list with a applied to that node’s predecessors. Read from right to left, the scan lemma is an efficiency-improving transformation, eliminating duplicate computations; but note that this only works on expressions where is a , because only then are there duplicate computations to eliminate. It’s an important result, because it relates a clear and simple specification on the right to a more efficient implementation on the left.

However, the left-to-right operators , , and are a little awkward in Haskell, because they go against the grain of the cons-based (ie, right-to-left) structure of lists. I leave as a simple exercise for the reader the task of writing the more natural , , and , and identifying the relationships between them. Conversely, one can view etc as the natural operators for snoc-based lists, which are constructed from nil and snoc rather than from nil and cons.

Upwards and downwards accumulations on binary treesWhat would , , , etc look like on different—and in particular, non-linear—datatypes? Let’s consider a simple instance, for homogeneous binary trees; that is, trees with a label at both internal and external nodes.

for which the obvious fold operator is

I’m taking the view that the appropriate generalization is to distribute data “upwards” and “downwards” through such a tree—from the leaves towards the root, and vice versa. This does indeed specialize to the definitions we had on lists when you view them vertically in terms of their “cons” structure: they’re long thin trees, in which every parent has exactly one child. (An alternative view would be to look at distributing data horizontally through a tree, from left to right and vice versa. Perhaps I’ll come back to that another time.)

The upwards direction is the easier one to deal with. An upwards accumulation labels every node of the tree with some function of its *descendants*; moreover, the descendants of a node themselves form a tree, so can be easily represented, and folded. So we can quite straightforwardly define:

where yields the root of a tree:

As with lists, the most basic upwards scan uses the constructors themselves as arguments:

and any other scan can be expressed, albeit less efficiently, in terms of this:

The downwards direction is more difficult, though. A downwards accumulation should label every node with some function of its *ancestors*; but these do not form another tree. For example, in the homogeneous binary tree

the ancestors of the node labelled are the nodes labelled . One could represent those ancestors simply as a list, ; but that rules out the possibility of a downwards accumulation treating left children differently from right children, which is essential in a number of algorithms (such as the parallel prefix and tree drawing algorithms in my thesis). A more faithful rendering is to define a new datatype of *paths* that captures the left and right turns—a kind of non-empty cons list, but with both a “left cons” and a “right cons” constructor:

(I called them “threads” in my thesis.) Then we can capture the data structure representing the ancestors of the node labelled

by the expression . I leave it as an exercise for the more energetic reader to work out a definition for

to compute the tree giving the ancestors of every node, and for a corresponding .

Generic upwards accumulations
Having seen ad-hoc constructions for a particular kind of binary tree, we should consider what the datatype-generic construction looks like. I discussed datatype-generic upwards accumulations already, in the post on Horner’s Rule; the construction was given in the paper Generic functional programming with types and relations by Richard Bird, Oege de Moor and Paul Hoogendijk. As with homogeneous binary trees, it’s still the case that the generic version of labels every node of a data structure of type with the descendants of that node, and still the case that the descendants form a data structure also of type . However, in general, the datatype does not allow for a label at every node, so we need the *labelled variant* where . Then we can define

where returns the root label of a labelled data structure—by construction, every labelled data structure has a root label—and is the unique arrow to the unit type. Moreover, we get a datatype-generic operator, and a Scan Lemma:

Generic downwards accumulations, via linearization

The best part of a decade after my thesis work, inspired by the paper by Richard Bird & co, I set out to try to define datatype-generic versions of downward accumulations too. I wrote a paper about it for MPC 1998, and then came up with a new construction for the journal version of that paper in SCP in 2000. I now think these constructions are rather clunky, and I have a better one; if you don’t care to explore the culs-de-sac, skip this section and the next and go straight to the section on derivatives.

The MPC construction was based around a datatype-generic version of the datatype above, to represent the “ancestors” of a node in an inductive datatype. The tricky bit is that data structures in general are non-linear—a node may have many children—whereas paths are linear—every node has exactly one child, except the last which has none; how can we define a “linear version” of ? Technically, we might say that a functor is linear (actually, “affine” would be a better word) if it distributes over sum.

The construction in the paper assumed that was a sum of products of literals

where each is either , , or some constant type such as or . For example, for leaf-labelled binary trees

the shape functor is , so (there are two variants), (the first variant has a single literal, ) and (the second variant has two literals, and ), and:

Then for each we define a -ary functor , where is the “degree of branching” of variant (ie, the number of s occurring in , which is the number of for which ), in such a way that

and is linear in each argument except perhaps the first. It’s a bit messy explicitly to give a construction for , but roughly speaking,

where is “the next unused ” when , and just otherwise. For example, for leaf-labelled binary trees, we have:

Having defined the linear variant of , we can construct the datatype of paths, as the inductive datatype of shape where

That is, paths are a kind of non-empty cons list. The path ends at some node of the original data structure; so the last element of the path is of type , which records the “local content” of a node (its shape and labels, but without any of its children). Every other element of the path consists of the local content of a node together with an indication of which direction to go next; this amounts to the choice of a variant , followed by the choice of one of identical copies of the local contents of variant , where is the degree of branching of variant . We model this as a base constructor and a family of “cons” constructors for and .

For example, for leaf-labelled binary trees, the “local content” for the last element of the path is either a single label (for tips) or void (for bins), and for the other path elements, there are zero copies of the local content for a tip (because a tip has zero children), and two copies of the void local information for bins (because a bin has two children). Therefore, the path datatype for such trees is

which is isomorphic to the definition that you might have written yourself:

For homogeneous binary trees, the construction gives

which is almost the ad-hoc definition we had two sections ago, except that it distinguishes singleton paths that terminate at an external node from those that terminate at an internal one.

Now, analogous to the function which labels every node with its descendants, we can define a function to label every node with its ancestors, in the form of the path to that node. One definition is as a fold; informally, at each stage we construct a singleton path to the root, and map the appropriate “cons” over the paths to each node in each of the children (see the paper for a concrete definition). This is inefficient, because of the repeated maps; it’s analogous to defining by

A second definition is as an unfold, maintaining as an accumulating parameter of type the “path so far”; this avoids the maps, but it is still quadratic because there are no common subexpressions among the various paths. (This is analogous to an accumulating-parameter definition of :

Even with an accumulating “Hughes list” parameter, it still takes quadratic time.)

The downwards accumulation itself is defined as a path fold mapped over the paths, giving a Scan Lemma for downwards accumulations. With either the fold or the unfold definition of paths, this is still quadratic, again because of the lack of common subexpressions in a result of quadratic size. However, in some circumstances the path fold can be reassociated (analogous to turning a into a ), leading finally to a linear-time computation; see the paper for the details of how.

Generic downwards accumulations, via zipI was dissatisfied with the “…”s in the MPC construction of datatype-generic paths, but couldn’t see a good way of avoiding them. So in the subsequent SCP version of the paper, I presented an alternative construction of downwards accumulations, which does not go via a definition of paths; instead, it goes directly to the accumulation itself.

As with the efficient version of the MPC construction, it is coinductive, and uses an accumulating parameter to carry in to each node the seed from higher up in the tree; so the downwards accumulation is of type . It is defined as an unfold, with a body of type

The result of applying the body will be constructed from two components, of types and : the first gives the root label of the accumulation and the seeds for processing the children, and the second gives the children themselves.

These two components get combined to make the whole result via a function

This will be partial in general, defined only for pairs of -structures of the same shape.

The second component of is the easier to define; given input , it unpacks the to , and discards the and the (recall that is the labelled variant of , where ).

For the first component, we enforce the constraint that all output labels are dependent only on their ancestors by unpacking the and pruning off the children, giving input . We then suppose as a parameter to the accumulation a function of type to complete the construction of the first component. In order that the two components can be zipped together, we require that is shape-preserving in its second argument:

where is the unique function to the unit type. Then, although the built from these two components depends on the partial function , it will still itself be total.

The SCP construction gets rid of the “…”s in the MPC construction. It is also inherently efficient, in the sense that if the core operation takes constant time then the whole accumulation takes linear time. However, use of the partial function to define a total accumulation is a bit unsatisfactory, taking us outside the domain of sets and total functions. Moreover, there’s now only half an explanation in terms of paths: accumulations in which the label attached to each node depends only on the *list* of its ancestors, and not on the left-to-right ordering of siblings, can be factored into a list function (in fact, a ) mapped over the “paths”, which is now a tree of lists; but accumulations in which left children are treated differently from right children, such as the parallel prefix and tree drawing algorithms mentioned earlier, can not.

After another interlude of about a decade, and with the benefit of new results to exploit, I had a “eureka” moment: the linearization of a shape functor is closely related to the beautiful notion of the *derivative* of a datatype, as promoted by Conor McBride. The crucial observation Conor made is that the “one-hole contexts” of a datatype—that is, for a container datatype, the datatype of data structures with precisely one element missing—can be neatly formalized using an analogue of the rules of differential calculus. The one-hole contexts are precisely what you need to identify which particular child you’re talking about out of a collection of children. (If you’re going to follow along with some coding, I recommend that you also read Conor’s paper Clowns to the left of me, jokers to the right. This gives the more general construction of *dissecting* a datatype, identifying a unique hole, but also allowing the “clowns” to the left of the hole to have a different type from the “jokers” to the right. I think the explanation of the relationship with the differential calculus is much better explained here; the original notion of derivative can be retrieved by specializing the clowns and jokers to the same type.)

The essence of the construction is the notion of a *derivative* of a functor . For our purposes, we want the derivative in the second argument only of a bifunctor; informally, is like , but with precisely one missing. Given such a one-hole context, and an element with which to plug the hole, one can reconstruct the whole structure:

That’s how to consume one-hole contexts; how can we produce them? We could envisage some kind of inverse of , which breaks an -structure into an element and a context; but this requires us to invent a language for specifying which particular element we mean— is not injective, so needs an extra argument. A simpler approach is to provide an operator that annotates every position at once with the one-hole context for that position:

One property of is that it really is an annotation—if you throw away the annotations, you get back what you started with:

A second property relates it to —each of elements in a hole position plugs into its associated one-hole context to yield the same whole structure back again:

(I believe that those two properties completely determine and .)

Incidentally, the derivative of a bifunctor can be elegantly represented as an *associated type synonym* in Haskell, in a type class of bifunctors differentiable in their second argument, along with and :

Conor’s papers show how to define instances of for all polynomial functors —anything made out of constants, projections, sums, and products.

The path to a node in a data structure is simply a list of one-hole contexts—let’s say, innermost context first, although it doesn’t make much difference—but with all the data off the path (that is, the other children) stripped away:

This is a projection of Huet’s zipper, which preserves the off-path children, and records also the subtree in focus at the end of the path:

Since the contexts are listed innermost-first in the path, closing up a zipper to reconstruct a tree is a over the path:

Now, let’s develop the function , which turns a tree into a labelled tree of paths. We will write it with an accumulating parameter, representing the “path so far”:

Given the components of a tree and a path to its root, must construct the corresponding labelled tree of paths. Since and , this amounts to constructing a value of type . For the first component of this pair we will use , the path so far. The second component can be constructed from by identifying all children via , discarding some information with judicious s, consing each one-hole context onto to make a longer path, then making recursive calls on each child:

That is,

Downwards accumulations are then path functions mapped over the result of . However, we restrict ourselves to path functions that are instances of , because only then are there common subexpressions to be shared between a parent and its children (remember that paths are innermost-first, so related nodes share a tail of their ancestors).

Moreover, it is straightforward to fuse the with , to obtain

which takes time linear in the size of the tree, assuming that and take constant time.

Finally, in the case that the function being mapped over the paths is a as well as a , then we can apply the Third Homomorphism Theorem to conclude that it is also an associative fold over lists. From this (I believe) we get a very efficient parallel algorithm for computing the accumulation, taking time logarithmic in the size of the tree—even if the tree has greater than logarithmic depth.