News aggregator

Edward Z. Yang: Open type families are not modular

Planet Haskell - Thu, 09/04/2014 - 4:12pm

One of the major open problems for building a module system in Haskell is the treatment of type classes, which I have discussed previously on this blog. I've noted how the current mode of use in type classes in Haskell assume “global uniqueness”, which is inherently anti-modular; breaking this assumption risks violating the encapsulation of many existing data types.

As if we have a choice.

In fact, our hand is forced by the presence of open type families in Haskell, which are feature many similar properties to type classes, but with the added property that global uniqueness is required for type safety. We don't have a choice (unless we want type classes with associated types to behave differently from type classes): we have to figure out how to reconcile the inherent non-modularity of type families with the Backpack module system.

In this blog post, I want to carefully lay out why open type families are inherently unmodular and propose some solutions for managing this unmodularity. If you know what the problem is, you can skip the first two sections and go straight to the proposed solutions section.

Before we talk about open type family instances, it's first worth emphasizing the (intuitive) fact that a signature of a module is supposed to be able to hide information about its implementation. Here's a simple example:

module A where x :: Int module B where import A y = 0 z = x + y

Here, A is a signature, while B is a module which imports the signature. One of the points of a module system is that we should be able to type check B with respect to A, without knowing anything about what module we actually use as the implementation. Furthermore, if this type checking succeeds, then for any implementation which provides the interface of A, the combined program should also type check. This should hold even if the implementation of A defines other identifiers not mentioned in the signature:

module A where x = 1 y = 2

If B had directly imported this implementation, the identifier y would be ambiguous; but the signature filtered out the declarations so that B only sees the identifiers in the signature.

With this in mind, let's now consider the analogous situation with open type families. Assuming that we have some type family F defined in the prelude, we have the same example:

module A where type instance F Int f :: F Bool module B where import A type instance F Bool = Int -> Bool x = f 2

Now, should the following module A be a permissible implementation of the signature?

module A where type instance F Int = Int type instance F Bool = Int f = 42

If we view this example with the glasses off, we might conclude that it is a permissible implementation. After all, the implementation of A provides an extra type instance, yes, but when this happened previously with a (value-level) declaration, it was hidden by the signature.

But if put our glasses on and look at the example as a whole, something bad has happened: we're attempting to use the integer 42 as a function from integers to booleans. The trouble is that F Bool has been given different types in the module A and module B, and this is unsound... like, segfault unsound. And if we think about it some more, this should not be surprising: we already knew it was unsound to have overlapping type families (and eagerly check for this), and signature-style hiding is an easy way to allow overlap to sneak in.

The distressing conclusion: open type families are not modular.

So, what does this mean? Should we throw our hands up and give up giving Haskell a new module system? Obviously, we’re not going to go without a fight. Here are some ways to counter the problem.

The basic proposal: require all instances in the signature

The simplest and most straightforward way to solve the unsoundness is to require that a signature mention all of the family instances that are transitively exported by the module. So, in our previous example, the implementation of A does not satisfy the signature because it has an instance which is not mentioned in the signature, but would satisfy this signature:

module A where type instance F Int type instance F Bool

While at first glance this might not seem too onerous, it's important to note that this requirement is transitive. If A happens to import another module Internal, which itself has its own type family instances, those must be represented in the signature as well. (It's easy to imagine this spinning out of control for type classes, where any of the forty imports at the top of your file may be bringing in any manner of type classes into scope.) There are two major user-visible consequences:

  1. Module imports are not an implementation detail—you need to replicate this structure in the signature file, and
  2. Adding instances is always a backwards-incompatible change (there is no weakening).

Of course, as Richard pointed out to me, this is already the case for Haskell programs (and you just hoped that adding that one extra instance was "OK").

Despite its unfriendliness, this proposal serves as the basis for the rest of the proposals, which you can conceptualize as trying to characterize, “When can I avoid having to write all of the instances in my signature?”

Extension 1: The orphan restriction

Suppose that I write the following two modules:

module A where data T = T type instance F T = Bool module B where import A type instance F T = Int -> Int

While it is true that these two type instances are overlapping and rightly rejected, they are not equally at fault: in particular, the instance in module B is an orphan. An orphan instance is an instance for type class/family F and data type T (it just needs to occur anywhere on the left-hand side) which lives in a module that defines neither. (A is not an orphan since the instance lives in the same module as the definition of data type T).

What we might wonder is, “If we disallowed all orphan instances, could this rule out the possibility of overlap?” The answer is, “Yes! (...with some technicalities).” Here are the rules:

  1. The signature must mention all what we will call ragamuffin instances transitively exported by implementations being considered. An instance of a family F is a ragamuffin if it is not defined with the family definition, or with the type constructor at the head in the first parameter. (Or some specific parameter, decided on a per-family basis.) All orphan instances are ragamuffins, but not all ragamuffins are orphans.
  2. A signature exporting a type family must mention all instances which are defined in the same module as the definition of the type family.
  3. It is strictly optional to mention non-ragamuffin instances in a signature.

(Aside: I don't think this is the most flexible version of the rule that is safe, but I do believe it is the most straightforward.) The whole point of these rules is to make it impossible to write an overlapping instance, while only requiring local checking when an instance is being written. Why did we need to strengthen the orphan condition into a ragamuffin condition to get this non-overlap? The answer is that absence of orphans does not imply absence of overlap, as this simple example shows:

module A where data A = A type instance F A y = Int module B where data B = B type instance F x B = Bool -> Bool

Here, the two instances of F are overlapping, but neither are orphans (since their left-hand sides mention a data type which was defined in the module.) However, the B instance is a ragamuffin instance, because B is not mentioned in the first argument of F. (Of course, it doesn't really matter if you check the first argument or the second argument, as long as you're consistent.)

Another way to think about this rule is that open type family instances are not standalone instances but rather metadata that is associated with a type constructor when it is constructed. In this way, non-ragamuffin type family instances are modular!

A major downside of this technique, however, is that it doesn't really do anything for the legitimate uses of orphan instances in the Haskell ecosystem: when third-parties defined both the type family (or type class) and the data type, and you need the instance for your own purposes.

Extension 2: Orphan resolution

This proposal is based off of one that Edward Kmett has been floating around, but which I've refined. The motivation is to give a better story for offering the functionality of orphan instances without gunking up the module system. The gist of the proposal is to allow the package manager to selectively enable/disable orphan definitions; however, to properly explain it, I'd like to do first is describe a few situations involving orphan type class instances. (The examples use type classes rather than type families because the use-cases are more clear. If you imagine that the type classes in question have associated types, then the situation is the same as that for open type families.)

The story begins with a third-party library which defined a data type T but did not provide an instance that you needed:

module Data.Foo where data Foo = Foo module MyApp where import Data.Foo fooString = show Foo -- XXX no instance for Show

If you really need the instance, you might be tempted to just go ahead and define it:

module MyApp where import Data.Foo instance Show Foo where -- orphan show Foo = "Foo" fooString = show Foo

Later, you upgrade Data.Foo to version 1.0.0, which does define a Show instance, and now your overlapping instance error! Uh oh.

How do we get ourselves out of the mess? A clue is how many package authors currently “get out of jail” by using preprocessor macros:

{-# LANGUAGE CPP #-} module MyApp where import Data.Foo #if MIN_VERSION_foo(1,0,0) instance Show Foo where -- orphan show Foo = "Foo" #endif fooString = show Foo

Morally, we'd like to hide the orphan instance when the real instance is available: there are two variations of MyApp which we want to transparently switch between: one which defines the orphan instance, and one which does not and uses the non-orphan instance defined in the Data.Foo. The choice depends on which foo was chosen, a decision made by the package manager.

Let's mix things up a little. There is no reason the instance has to be a non-orphan coming from Data.Foo. Another library might have defined its own orphan instance:

module MyOtherApp where import Data.Foo instance Show Foo where ... -- orphan otherFooString = show Foo module MyApp where import Data.Foo instance Show Foo where ... -- orphan fooString = show Foo module Main where import MyOtherApp import MyApp main = print (fooString ++ otherFooString ++ show Foo)

It's a bit awful to get this to work with preprocessor macros, but there are two ways we can manually resolve the overlap: we can erase the orphan instance from MyOtherApp, or we can erase the orphan instance from MyApp. A priori, there is no reason to prefer one or the other. However, depending on which one is erased, Main may have to be compiled differently (if the code in the instances is different). Furthermore, we need to setup a new (instance-only) import between the module who defines the instance to the module whose instance was erased.

There are a few takeaways from these examples. First, the most natural way of resolving overlapping orphan instances is to simply “delete” the overlapping instances; however, which instance to delete is a global decision. Second, which overlapping orphan instances are enabled affects compilation: you may need to add module dependencies to be able to compile your modules. Thus, we might imagine that a solution allows us to do both of these, without modifying source code.

Here is the game plan: as before, packages can define orphan instances. However, the list of orphan instances a package defines is part of the metadata of the package, and the instance itself may or may not be used when we actually compile the package (or its dependencies). When we do dependency resolution on a set of packages, we have to consider the set of orphan instances being provided and only enable a set which is non-overlapping, the so called orphan resolution. Furthermore, we need to add an extra dependency from packages whose instances were disabled to the package who is the sole definer of an instance (this might constrain which orphan instance we can actually pick as the canonical instance).

The nice thing about this proposal is that it solves an already existing pain point for type class users, namely defining an orphan type class instance without breaking when upstream adds a proper instance. But you might also think of it as a big hack, and it requires cooperation from the package manager (or some other tool which manages the orphan resolution).

The extensions to the basic proposal are not mutually exclusive, but it's an open question whether or not the complexity they incur are worth the benefits they bring to existing uses of orphan instances. And of course, there may other ways of solving the problem which I have not described here, but this smorgasbord seems to be the most plausible at the moment.

At ICFP, I had an interesting conversation with Derek Dreyer, where he mentioned that when open type families were originally going into GHC, he had warned Simon that they were not going to be modular. With the recent addition of closed type families, many of the major use-cases for open type families stated in the original paper have been superseded. However, even if open type families had never been added to Haskell, we still might have needed to adopt these solutions: the global uniqueness of instances is deeply ingrained in the Haskell community, and even if in some cases we are lax about enforcing this constraint, it doesn't mean we should actively encourage people to break it.

I have a parting remark for the ML community, as type classes make their way in from Haskell: when you do get type classes in your language, don’t make the same mistake as the Haskell community and start using them to enforce invariants in APIs. This way leads to the global uniqueness of instances, and the loss of modularity may be too steep a price to pay.

Postscript. One natural thing to wonder, is if overlapping type family instances are OK if one of the instances “is not externally visible.” Of course, the devil is in the details; what do we mean by external visibility of type family instances of F?

For some definitions of visibility, we can find an equivalent, local transformation which has the same effect. For example, if we never use the instance at all, it certainly OK to have overlap. In that case, it would also have been fine to delete the instance altogether. As another example, we could require that there are no (transitive) mentions of the type family F in the signature of the module. However, eliminating the mention of the type family requires knowing enough parameters and equations to reduce: in which case the type family could have been replaced with a local, closed type family.

One definition that definitely does not work is if F can be mentioned with some unspecified type variables. Here is a function which coerces an Int into a function:

module A where type instance F Int = Int f :: Typeable a => a -> F a f x = case eqT of Just Refl -> x :: Int Nothing -> undefined module ASig where f :: Typeable a => a -> F a module B where import ASig type instance F Int = Bool -> Bool g :: Bool g = f 0 True -- oops

...the point being that, even if a signature doesn't directly mention the overlapping instance F Int, type refinement (usually by some GADT-like structure) can mean that an offending instance can be used internally.

Categories: Offsite Blogs

Roman Cheplyaka: Dependent Haskell

Planet Haskell - Thu, 09/04/2014 - 3:00pm

Emulating dependent types (and, more generally, advanced type-level programming) has been a hot topic in the Haskell community recently. Some incredible work has been done in this direction: GADTs, open and closed type families, singletons, etc. The plan is to copy even more features to the type level, like type classes and GADTs, and simplify the promotion of value-level functions.

On the other hand, there’s a good deal of scepticism around this idea. «If you want to program like in Agda, why don’t you program in Agda?»

First, libraries. It took us many years to arrive at the state of hackage that is suitable for industrial usage — and we still don’t have satisfactory answers to many problems. My guess is that it will take at least as long as that for the dependently typed community to arrive at this point — not only because of the smaller community, but also because they will look for even more type-safe solutions, which is naturally a harder problem.

Making your code extremely type-safe is quite hard (or else it would not be worth an ICFP talk). In a real-world application, you’d probably have just a few places where it’s worth the effort. But if you write the whole application in a dependently-typed language, you pay the price for your whole codebase. The price includes, for instance, the absence of type inference.

The compilation problem is not solved either. In particular, there are concerns about type erasure in both Agda and Idris. This is not unsolvable, just hasn’t been done yet.

So maybe you could write some parts in Agda/Idris and the rest in Haskell? Neither Agda nor Idris has a good story for that. For instance, Agda can generate Haskell code, but interfacing with Haskell won’t be very smooth. And the differences in languages mean that it probably won’t ever be effortless.

Don’t get me wrong, I am very enthusiastic about these (and future) languages. However, it doesn’t seem like they will be ready for production usage anytime soon. At the same time, you can satisfy the majority of your dependently typed needs in Haskell right now, no hidden charges. Isn’t that cool?


I am no expert in Agda/Idris. The above is my impression from talking to different people at ICFP and participating in tutorials on these two languages given by their corresponding authors. I’ll gladly accept rebuttals.


There’s a discussion happening on /r/haskell, where people dispute many of my points. I encourage you to read that discussion and make your own conclusions.

In particular, it has been pointed out that erasure is not as big of a problem as I thought initially. Still, it’ll take quite some time (and non-trivial systems built with dependent types) for us to be able to trust Idris/Agda compilers as much as we now trust GHC.

Categories: Offsite Blogs


haskell-cafe - Thu, 09/04/2014 - 2:47pm
Just to let you know that I created an heroku buildpack that install GHC and Haste in the running environment, so that web applications can invoke ghc, cabal, hastec haste-inst etc. I use it for the creation of an Haste IDE. But it may have other uses.
Categories: Offsite Discussion

Extensible Entities

Haskell on Reddit - Thu, 09/04/2014 - 1:54pm
Categories: Incoming News

Versioning modules, not packages

Haskell on Reddit - Thu, 09/04/2014 - 12:53pm

Let's say we have a big package, like lens.

Now, whenever there is a breaking change in any one of its modules, however small it may be, we have to bump the major version number (or else we're being evil). Bumping it means that packages that specify dependencies in the way recommended in the package versioning policy won't work with this new version, even if the change doesn't apply to them.

So my question/suggestion is the following: wouldn't it be cool if not only packages, but also modules were versioned?

Example/idea (backwards compatible, "opt-in"):

name: a-package version: 1.0 library exposed-modules: Package.M1, Package.M2

No new syntax yet, M1 and M2 inherit a-package's version.

Now we release a new major version, with changes that only affect M2.

name: a-package version: 1.1 library exposed-modules: Package.M1-1.0, Package.M2

M1 didn't change, so we specify that its version is still 1.0 (proposed new syntax). We didn't specify a version for M2, which means that it inherits the package's version, so it's at 1.1 now.

If there is a package that was using a-package's M1 like this (another sprinkle of new syntax ahead):

name: another-package version: 1.0 library build-depends: a-package (Package.M1 == 1.0.*)

then another-package will still work with the newly released a-package-1.1, as it doesn't use M2.


  • less dependency hell
  • updates that only change a couple numbers in a .cabal file are needed less often


  • now it's not enough to just look at a package's version when getting dependencies, .cabal files also need to be inspected
  • if there's not one, but multiple numbers, it's easier to make a mistake

Thoughts? I can't be the first person to get this idea... what are the downsides, why do no package managers work this way?

submitted by willIEverGraduate
[link] [14 comments]
Categories: Incoming News

call by need vs call by value

Haskell on Reddit - Thu, 09/04/2014 - 10:40am

given an AST, it is easy to see how using a call by value strategy we can split in chunks the AST and compute in parallel the dependencies. We simply evaluate all the arguments in parallel and past them to the corresponding functions. But how can we split and compute in parallel the dependencies of a AST using a call by need evaluation strategy? Can someone please point me to some references.


submitted by felipeZ
[link] [6 comments]
Categories: Incoming News

Ian Ross: Non-diffusive atmospheric flow #5: pre-processing

Planet Haskell - Thu, 09/04/2014 - 6:54am
Non-diffusive atmospheric flow #5: pre-processing September 4, 2014

Note: there are a couple of earlier articles that I didn’t tag as “haskell” so they didn’t appear in Planet Haskell. They don’t contain any Haskell code, but they cover some background material that’s useful to know (#3 talks about reanalysis data and what Z500 is, and #4 displays some of the characteristics of the data we’re going to be using). If you find terms here that are unfamiliar, they might be explained in one of these earlier articles.

The code for this post is available in a Gist.

Before we can get into the “main analysis”, we need to do some pre-processing of the Z500 data. In particular, we are interested in large-scale spatial structures, so we want to subsample the data spatially. We are also going to look only at the Northern Hemisphere winter, so we need to extract temporal subsets for each winter season. (The reason for this is that winter is the season where we see the most interesting changes between persistent flow regimes. And we look at the Northern Hemisphere because it’s where more people live, so it’s more familiar to more people.) Finally, we want to look at variability about the seasonal cycle, so we are going to calculate “anomalies” around the seasonal cycle.

We’ll do the spatial and temporal subsetting as one pre-processing step and then do the anomaly calculation seperately, just for simplicity.

Spatial and temporal subsetting

The title of the paper we’re trying to follow is “Observed Nondiffusive Dynamics in Large-Scale Atmospheric Flow”, so we need to decide what we mean by “large-scale” and to subset our data accordingly. The Z500 data from the NCEP reanalysis dataset is at 2.5° × 2.5° resolution, which turns out to be a little finer than we need, so we’re going to extract data on a 5° × 5° grid instead. We’ll also extract only the Northern Hemisphere data, since that’s what we’re going to work with.

For the temporal subsetting, we need to take 181 days of data for each year starting on 1 November each year. Since the data starts at the beginning of 1948 and goes on to August 2014 (which is when I’m writing this), we’ll have 66 years of data, from November 1948 until April 2014. As usual when handling dates, there’s some messing around because of leap years, but here it basically just comes down to which day of the year 1 November is in a given year, so it’s not complicated.

The daily NCEP reanalysis geopotential height data comes in one file per year, with all the pressure levels used in the dataset bundled up in each file. That means that the geopotential height variable in each file has coordinates: time, level, latitude, longitude, so we need to slice out the 500 mb level as we do the other subsetting.

All this is starting to sound kind of complicated, and this brings us to a regrettable point about dealing with this kind of data – it’s messy and there’s a lot of boilerplate code to read and manipulate coordinate metadata. This is true pretty much whatever language you use for processing these multi-dimensional datasets and it’s kind of unavoidable. The trick is to try to restrict this inconvenient stuff to the pre-processing phase by using a consistent organisation of your data for the later analyses. We’re going to do that here by storing all of our Z500 anomaly data in a single NetCDF file, with 181-day long winter seasons back-to-back for each year. This will make time and date processing trivial.

The code for the data subsetting is in the subset.hs program in the Gist. We’ll deal with it in a few bites.

Skipping the imports that we need, as well as a few “helper” type synonym definitions, the first thing that we need to do it open one of the input NCEP NetCDF files to extract the coordinate metadata information. This listing shows how we do this:

Right refnc <- openFile $ indir </> "" let Just nlat = ncDimLength <$> ncDim refnc "lat" Just nlon = ncDimLength <$> ncDim refnc "lon" Just nlev = ncDimLength <$> ncDim refnc "level" let (Just lonvar) = ncVar refnc "lon" (Just latvar) = ncVar refnc "lat" (Just levvar) = ncVar refnc "level" (Just timevar) = ncVar refnc "time" (Just zvar) = ncVar refnc "hgt" Right lon <- get refnc lonvar :: SVRet CFloat Right lat <- get refnc latvar :: SVRet CFloat Right lev <- get refnc levvar :: SVRet CFloat

We open the first of the NetCDF files (I’ve called the directory where I’ve stored these things indir) and use the hnetcdf ncDim and ncVar functions to get the dimension and variable metadata for the latitude, longitude, level and time dimensions; we then read the complete contents of the “coordinate variables” (for level, latitude and longitude) as Haskell values (here, SVRet is a type synonym for a storable vector wrapped up in the way that’s returned from the hnetcdf get functions).

Once we have the coordinate variable values, we need to find the index ranges to use for subsetting. For the spatial subsetting, we find the start and end ranges for the latitudes that we want (17.5°N-87.5°N) and for the level, we find the index of the 500 mb level:

let late = vectorIndex LT FromEnd lat 17.5 lats = vectorIndex GT FromStart lat 87.5 levi = vectorIndex GT FromStart lev 500.0

using a helper function to find the correspondence between coordinate values and indexes:

data IndexStart = FromStart | FromEnd vectorIndex :: (SV.Storable a, Ord a) => Ordering -> IndexStart -> SV.Vector a -> a -> Int vectorIndex o s v val = case (go o, s) of (Nothing, _) -> (-1) (Just i, FromStart) -> i (Just i, FromEnd) -> SV.length v - 1 - i where go LT = SV.findIndex (>= val) vord go GT = SV.findIndex (<= val) vord vord = case s of FromStart -> v FromEnd -> SV.reverse v

For the temporal subsetting, we just work out what day of the year 1 November is for leap and non-leap years – since November and December together are 61 days, for each winter season we need those months plus the first 120 days of the following year:

let inov1non = 305 - 1 -- ^ Index of 1 November in non-leap years. wintertsnon = [0..119] ++ [inov1non..365-1] -- ^ Indexes of all winter days for non-leap years. inov1leap = 305 + 1 - 1 -- ^ Index of 1 November in leap years. wintertsleap = [0..119] ++ [inov1leap..366-1] -- ^ Indexes of all winter days for leap years. winterts1948 = [inov1leap..366-1] winterts2014 = [0..119] -- ^ Indexes for winters in start and end years.

This is kind of hokey, and in some cases you do need to do more sophisticated date processing, but this does the job here.

Once we have all this stuff set up, the critical part of the subsetting is easy – for each input data file, we figure out what range of days we need to read, then use a single call the getS from hnetcdf (“get slice”):

forM_ winterts $ \it -> do -- Read a slice of a single time-step of data: Northern -- Hemisphere (17.5-87.5 degrees), 5 degree resolution, 500 -- mb level only. let start = [it, levi, lats, 0] count = [1, 1, (late - lats) `div` 2 + 1, nlon `div` 2] stride = [1, 1, 2, 2] Right slice <- getS nc zvar start count stride :: RepaRet2 CShort

Here, we have a set of start indexes, a set of counts and a set of strides, one for each dimension in our input variable. Since the input geopotential height files have dimensions of time, level, latitude and longitude, we have four entries in each of our start, count and stride lists. The start values are the current day from the list of days we need (called it in the code), the level we’re interested in (levi), the start latitude index (lats) and zero, since we’re going to get the whole range of longitude. The count list gets a single time step, a single level, and a number of latitude and longitude values based on taking every other entry in each direction (since we’re subsetting from a spatial resolution of 2.5° × 2.5° to a resolution of 5° × 5°). Finally, for the stride list, we use a stride of one for the time and level directions (which doesn’t really matter anyway, since we’re only reading a single entry in each of those directions) and a stride of two in the latitude and longitude directions (which gives us the “every other one” subsetting in those directions).

All of the other code in the subsetting program is involved in setting up the output file and writing the Z500 slices out. Setting up the output file is slightly tedious (this is very common when dealing with NetCDF files – there’s always lots of metadata to be managed), but it’s made a little simpler by copying attributes from one of the input files, something that doing this in Haskell makes quite a bit easier than in Fortran or C. The next listing shows how the NcInfo for the output file is created, which can then be passed to the hnetcdf withCreateFile function to actually create the output file:

let outlat = SV.fromList $ map (lat SV.!) [lats,lats+2..late] outlon = SV.fromList $ map (lon SV.!) [0,2..nlon-1] noutlat = SV.length outlat noutlon = SV.length outlon outlatdim = NcDim "lat" noutlat False outlatvar = NcVar "lat" NcFloat [outlatdim] (ncVarAttrs latvar) outlondim = NcDim "lon" noutlon False outlonvar = NcVar "lon" NcFloat [outlondim] (ncVarAttrs lonvar) outtimedim = NcDim "time" 0 True outtimeattrs = foldl' (flip M.delete) (ncVarAttrs timevar) ["actual_range"] outtimevar = NcVar "time" NcDouble [outtimedim] outtimeattrs outz500attrs = foldl' (flip M.delete) (ncVarAttrs zvar) ["actual_range", "level_desc", "valid_range"] outz500var = NcVar "z500" NcShort [outtimedim, outlatdim, outlondim] outz500attrs outncinfo = emptyNcInfo (outdir </> "") # addNcDim outlatdim # addNcDim outlondim # addNcDim outtimedim # addNcVar outlatvar # addNcVar outlonvar # addNcVar outtimevar # addNcVar outz500var

Although we can mostly just copy the coordinate variable attributes from one of the input files, we do need to do a little bit of editing of the attributes to remove some things that aren’t appropriate for the output file. Some of these things are just conventions, but there are some tools that may look at these attributes (actual_range, for example) and may complain if the data doesn’t match the attribute. It’s easier just to remove the suspect ones here.

This isn’t pretty Haskell by any means, and the hnetcdf library could definitely do with having some higher-level capabilities to help with this kind of file processing. I may add some based on the experimentation I’m doing here – I’m developing hnetcdf in parallel with writing this!

Anyway, the result of running the subsetting program is a single NetCDF file containing 11946 days (66 × 181) of Z500 data at a spatial resolution of 5° × 5°. We can then pass this on to the next step of our processing.

Seasonal cycle removal

In almost all investigations in climate science, the annual seasonal cycle stands out as the largest form of variability (not always true in the tropics, but in the mid-latitudes and polar regions that we’re looking at here, it’s more or less always true). The problem, of course, is that the seasonal cycle just isn’t all that interesting. We learn about the difference between summer and winter when we’re children, and although there is much more to say about seasonal variations and how they interact with other phenomena in the climate system, much of the time they just get in the way of seeing what’s going on with those other phenomena.

So what do we do? We “get rid” of the seasonal cycle by looking at what climate scientists call “anomalies”, which are basically just differences between the values of whatever variable we’re looking at and values from a “typical” year. For example, if we’re interested in daily temperatures in Innsbruck over the course of the twentieth century, we construct a “typical” year of daily temperatures, then for each day of our 20th century time series, we subtract the “typical” temperature value for that day of the year from the measured temperature value to give a daily anomaly. Then we do whatever analysis we want based on those anomalies, perhaps looking for inter-annual variability on longer time scales, or whatever.

This approach is very common, and it’s what we’re going to do for our Northern Hemisphere winter-time Z500 data here. To do this, we need to do two things: we need to calculate a “typical” annual cycle, and we need to subtract that typical annual cycle from each year of our data.

OK, so what’s a “typical” annual cycle? First, let’s say a word about what we mean by “annual cycle” in this case. We’re going to treat each spatial point in our data independently, trusting to the natural spatial correlation in the geopotential height to smooth out any shorter-term spatial inhomogeneities in the typical patterns. We then do some sort of averaging in time to generate a “typical” annual cycle at each spatial point. It’s quite common to use the mean annual cycle over a fixed period for this purpose (a period of 30 years is common: 1960-1990, say). In our case, we’re going to use the mean annual cycle across all 66 years of data that we have. Here’s how we calculate this mean annual cycle (this is from the seasonal-cycle.hs program in the Gist):

-- Each year has 181 days, so 72 x 15 x 181 = 195480 data values. let ndays = 181 nyears = ntime `div` ndays -- Use an Int array to accumulate values for calculating mean annual -- cycle. Since the data is stored as short integer values, this is -- enough to prevent overflow as we accumulate the 2014 - 1948 = 66 -- years of data. let sh = Repa.Z Repa.:. ndays Repa.:. nlat Repa.:. nlon slicecount = [ndays, nlat, nlon] zs = take (product slicecount) (repeat 0) init = Repa.fromList sh zs :: FArray3 Int -- Computation to accumulate a single year's data. let doone current y = do -- Read one year's data. let start = [(y - 1948) * ndays, 0, 0] Right slice <- getA innc z500var start slicecount :: RepaRet3 CShort return $! Repa.computeS $ current Repa.+^ ( fromIntegral slice) -- Accumulate all data. total <- foldM doone init [1948..2013] -- Calculate the final mean annual cycle. let mean = Repa.computeS $ (fromIntegral . (`div` nyears)) total :: FArray3 CShort

Since each year of data has 181 days, a full year’s data has 72 × 15 × 181 data values (72 longitude points, 15 latitude points, 181 days), i.e. 195,480 values. In the case here, since we have 66 years of data, there are 12,901,680 data values altogether. That’s a small enough number that we could probably slurp the whole data set into memory in one go for processing. However, we’re not going to do that, because there are plenty of cases in climate data analysis where the data sets are significantly larger than this, and you need to do “off-line” processing, i.e. to read data from disk piecemeal for processing.

We do a monadic fold (using the standard foldM function) over the list of years of data, and for each year we read a single three-dimensional slice of data representing the whole year and add it to the current accumulated sum of all the data. (This is what the doone function does: the only slight wrinkle here is that we need to deal with conversion from the short integer values stored in the data file to the Haskell Int values that we use in the accumulator. Otherwise, it’s just a question of applying Repa’s element-wise addition operator to the accumulator array and the year’s data.) Once we’ve accumulated the total values across all years, we divide by the number of years and convert back to short integer values, giving a short integer valued array containing the mean annual cycle – a three-dimensional array with one entry for each day in our 181-day winter season and for each latitude and longitude in the grid we’re using.

Once we have the mean annual cycle with which we want to calculate anomalies, determining the anomalies is simply a matter of subtracting the mean annual cycle from each year’s data, matching up longitude, latitude and day of year for each data point. The next listing shows the main loop that does this, reading a single day of data at a time, then subtracting the appropriate slice of the mean annual cycle to produce anomaly values (Repa’s slice function is handy here) and writing these to an output NetCDF file:

let count = [1, nlat, nlon] forM_ [0..ntime-1] $ \it -> do -- Read time slice. Right slice <- getA innc z500var [it, 0, 0] count :: RepaRet2 CShort -- Calculate anomalies and write out. let sl = Repa.Z Repa.:. (it `mod` 181) Repa.:. Repa.All Repa.:. Repa.All anom = Repa.computeS $ slice Repa.-^ (Repa.slice mean sl) :: FArray2 CShort putA outnc outz500var [it, 0, 0] count anom

The only thing we have to be a little bit careful about when we create the final anomaly output file is that we need to remove some of the attributes from the Z500 variable: because we’re now working with differences between actual values and our “typical” annual cycle, we no longer need the add_offset and scale_factor attributes that are used to convert from the stored short integer values to floating point geopotential height values. Instead, the values that we store in the anomaly file are the actual geopotential height anomaly values in metres.

After doing all this pre-processing, what we end up with is a single NetCDF file containing 66 winter seasons of daily Z500 anomalies for the region we’re interested in. The kind of rather boring data processing we’ve had to do to get to this point is pretty typical for climate data processing – you almost always need to do some sort of subsetting of your data, you often want to remove signals that aren’t interesting (like the annual cycle here, although things can get much more complicated than that). This kind of thing is unavoidable, and the best that you can really do is to try to organise things so that you do the pre-processing once and end up with data in a format that’s then easy to deal with for further processing. That’s definitely the case here, where we have fixed-length time series (181 days per winter) for each year, so we don’t need to do any messing around with dates.

In other applications, pre-processing can be a much bigger job. For functional brain imaging applications, for example, as well as extracting a three-dimensional image from the output of whatever (MRI or CT) scanner you’re using, you often need to do something about the low signal-to-noise ratios that you get, you need to compensate for subject motion in the scanner during the measurements, you need to compensate for time-varying physiological nuisance signals (breathing and heart beat), you need to spatially warp images to match an atlas image to enable inter-subject comparison, and so on. And all that is before you get to doing whatever statistical analysis you’re really interested in.

We will look at some of these “warts and all” pre-processing cases for other projects later on, but for now the small amount of pre-processing we’ve had to do here is enough. Now we can start with the “main analysis”.

Before we do that though, let’s take a quick look at what these anomaly values look like. The two figures below show anomaly plots for the same time periods for which the original Z500 data is shown in the plots in this earlier article. The “normal” anomaly plots are a bit more variable than the original Z500 plots, but the persistent pattern over the North Atlantic during the blocking episode in the second set of plots is quite clear. This gives us some hope that we’ll be able to pick out these persistent patterns in the data relatively easily.

First, the “normal” anomalies:

then the “blocking” anomalies:

data-analysis haskell <script src="" type="text/javascript"></script> <script type="text/javascript"> (function () { var articleId = fyre.conv.load.makeArticleId(null); fyre.conv.load({}, [{ el: 'livefyre-comments', network: "", siteId: "290329", articleId: articleId, signed: false, collectionMeta: { articleId: articleId, url: fyre.conv.load.makeCollectionUrl(), } }], function() {}); }()); </script>
Categories: Offsite Blogs

Haskell Weekly News: Issue 304

haskell-cafe - Thu, 09/04/2014 - 6:41am
Welcome to issue 304 of the HWN, an issue covering crowd-sourced bits of information about Haskell from around the web. This issue covers from August 24 to 30, 2014 Quotes of the Week * pjdelport: Haskell: Lazy evaluation, eager debugging. * trap_exit: isn't hoogle awesome? the first time I used it, I was like "for the first time in my life, google sucks" Top Reddit Stories * Hython - a Haskell-powered Python 3 interpreter Domain:, Score: 95, Comments: 22 Original: [1] On Reddit: [2] * Using Emacs for Haskell development Domain:, Score: 89, Comments: 47 Original: [3] On Reddit: [4] * Ivory Language Domain:, Score: 70, Comments: 36 Original: [5] On Reddit: [6] * A taste of Cabalized Backpack : Inside 206-105 Domain:, Score: 61, Comments: 73 Orig
Categories: Offsite Discussion

Suggestions on how to approach writing a specific type of server.

Haskell on Reddit - Thu, 09/04/2014 - 3:43am

We have a requirement at work for a task server. And we are thinking about writing it in haskell.

The server will do the following,

  • Wait for a connection
  • Get a json as payload.
  • The json payload will contain 3 fields. URL, TIME, CONTENT.
  • The job of the task server is to "POST" the "CONTENT" to the "URL" at the given "TIME"
  • This is the only job of this server. Nothing else. It will not be serving any html pages for now.

These are the constraints.

  • It should be scalable. When adding more servers, the load must be balance.
  • If the "POST" does not work the first time, it should retry a specified number of times before failing
  • Should be robust and reliable. A power failure should not corrupt the data.
  • It should be fast

So far, I'm leaning on using warp. But not sure if thats a good choice. The reason for this post is to get suggestions and opinions from more knowledgeable folks in the haskell community as to what libraries I can use and how I may proceed in writing this server. I wouldn't bother much if it was a personal project. Since this will be in production, I want to make sure I start on the right foot.


submitted by desijays
[link] [14 comments]
Categories: Incoming News

Proposal: Use uninterruptibleMask for cleanup actions in Control.Exception

libraries list - Wed, 09/03/2014 - 9:56pm
I'd like to propose a change in the behavior of Control.Exception to help guarantee cleanups are not accidentally lost. For example, bracket is defined as: bracket before after thing = mask $ \restore -> do a <- before r <- restore (thing a) `onException` after a _ <- after a return r This definition has a serious problem: "after a" (in either the exception handling case, or the ordinary case) can include interruptible actions which abort the cleanup. This means bracket does not in fact guarantee the cleanup occurs. For example: readMVar = bracket takeMVar putMVar -- If async exception occurs during putMVar, MVar is broken! withFile .. = bracket (openFile ..) hClose -- Async exception during hClose leaks the file handle! Interruptible actions during "before" are fine (as long as "before" handles them properly). Interruptible actions during "after" are virtually always a bug -- at best leaking a resource, and at worst breaking the program's invariants. I propose changing all the cleanup ha
Categories: Offsite Discussion

References on how to compile lazy functional languages

Haskell on Reddit - Wed, 09/03/2014 - 4:56pm

Hello to all!

I'm interested in implement a llvm compiler for a lazy version of simply typed lambda calculus, as an exercise. Which articles / books should I read in order to start this task? I know that there is the SPJ book "Implementation of Functional Programming Languages". Is this a good start or there is some more recent reference ?


submitted by rodrigogribeiro
[link] [3 comments]
Categories: Incoming News

Package bounds approximation

haskell-cafe - Wed, 09/03/2014 - 9:48am
Dear Cafe, I was wondering if there exists a tool to approximate what would be the minimal versions of the dependencies required by a package. I know that calculating this exactly is a bit intractable, but I was considering whether a tool that works as follows would work: 1) Compile the package with known to work dependencies 2) For each version of dependency as d: For each function in d used by My Package: * Perform unification of the resulting type when compiled with known to work dependencies and the type exported by the api of the version of the dependency in consideration. If all unifications succeeds, mark the version as compatible, incompatible otherwise This approximation is obviously not complete. Nevertheless, I would like to get opinions about whether this would be a good/useful/feasible approximation? Does the current GHC api export enough functionality for this package to be feasible? Are there alternatives? I was consider doing this as my `hack`
Categories: Offsite Discussion

ICFP "The Ghost of Church" -- collaborate?

Haskell on Reddit - Wed, 09/03/2014 - 9:00am

I am at ICFP right now (it's great!). There seems to be some kind of mystery going on called "The Ghost of Church". It involves some QR-codes that are put up at the conference venue here and there.

I have worked on it a bit, but I am utterly stuck! Anyone else here has taken a look at it? Has anyone solved it yet?

Shall we collaborate and win this thing with "team /r/haskell"? We could use this thread to share information.

submitted by icfpfan
[link] [3 comments]
Categories: Incoming News