News aggregator

vincenthz/hs-tls · GitHub

del.icio.us/haskell - Wed, 04/09/2014 - 2:31pm
Categories: Offsite Blogs

GHC 7.8.1 released | Hacker News

del.icio.us/haskell - Wed, 04/09/2014 - 11:56am
Categories: Offsite Blogs

Could not deduce (Eq a)

Haskell on Reddit - Wed, 04/09/2014 - 11:20am

Complete noob here. Just started learning Haskell so PLEASE bear with me.

Learning how pattern matching exactly works I attempted to create a useless function that will return True is the first element of the list is number 5 - otherwise False.

--startsWith5 :: Num a => [a] -> Bool startsWith5 (5:xs) = True startsWith5 xs = False

This is simply part of my initial experimentation with Haskell and trying to understand how patterns are matched.

Now... The above works unless I uncomment the

startsWith5 :: Num a => [a] -> Bool Could not deduce (Eq a) arising from the literal `5' from the context (Num a) bound by the type signature for startsWith5 :: Num a => [a] -> Bool at /home/app/Main.hs:(26,1)-(27,22) Possible fix: add (Eq a) to the context of the type signature for startsWith5 :: Num a => [a] -> Bool In the pattern: 5 In the pattern: 5 : xs In an equation for `startsWith5': startsWith5 (5 : xs) = True

I do not understand why it does not work. I am requiring "a" to be a Num and function will return a Bool. Seems to me that both patterns (5:xs) and (x:xs) operate on numbers and return Bool

Any insight is appreciated.

Thanks!

submitted by djogon
[link] [22 comments]
Categories: Incoming News

GHC 7.8.1 Released

Haskell on Reddit - Wed, 04/09/2014 - 8:11am
Categories: Incoming News

Dominic Steinitz: Gibbs Sampling in R, Haskell, Jags and Stan

Planet Haskell - Wed, 04/09/2014 - 5:43am
Introduction

It’s possible to Gibbs sampling in most languages and since I am doing some work in R and some work in Haskell, I thought I’d present a simple example in both languages: estimating the mean from a normal distribution with unknown mean and variance. Although one can do Gibbs sampling directly in R, it is more common to use a specialised language such as JAGS or STAN to do the actual sampling and do pre-processing and post-processing in R. This blog post presents implementations in native R, JAGS and STAN as well as Haskell.

Preamble > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-} > {-# LANGUAGE NoMonomorphismRestriction #-} > module Gibbs ( > main > , m > , Moments(..) > ) where > > import qualified Data.Vector.Unboxed as V > import qualified Control.Monad.Loops as ML > import Data.Random.Source.PureMT > import Data.Random > import Control.Monad.State > import Data.Histogram ( asList ) > import Data.Histogram.Fill > import Data.Histogram.Generic ( Histogram ) > import Data.List > import qualified Control.Foldl as L > > import Diagrams.Backend.Cairo.CmdLine > > import LinRegAux > > import Diagrams.Backend.CmdLine > import Diagrams.Prelude hiding ( sample, render )

The length of our chain and the burn-in.

> nrep, nb :: Int > nb = 5000 > nrep = 105000

Data generated from .

> xs :: [Double] > xs = [ > 11.0765808082301 > , 10.918739177542 > , 15.4302462747137 > , 10.1435649220266 > , 15.2112705014697 > , 10.441327659703 > , 2.95784054883142 > , 10.2761068139607 > , 9.64347295100318 > , 11.8043359297675 > , 10.9419989262713 > , 7.21905367667346 > , 10.4339807638017 > , 6.79485294803006 > , 11.817248658832 > , 6.6126710570584 > , 12.6640920214508 > , 8.36604701073303 > , 12.6048485320333 > , 8.43143879537592 > ] A Bit of Theory Gibbs Sampling

For a multi-parameter situation, Gibbs sampling is a special case of Metropolis-Hastings in which the proposal distributions are the posterior conditional distributions.

Referring back to the explanation of the metropolis algorithm, let us describe the state by its parameters and the conditional posteriors by where then

where we have used the rules of conditional probability and the fact that

Thus we always accept the proposed jump. Note that the chain is not in general reversible as the order in which the updates are done matters.

Normal Distribution with Unknown Mean and Variance

It is fairly standard to use an improper prior

The likelihood is

re-writing in terms of precision

Thus the posterior is

We can re-write the sum in terms of the sample mean and variance using

Thus the conditional posterior for is

which we recognise as a normal distribution with mean of and a variance of .

The conditional posterior for is

which we recognise as a gamma distribution with a shape of and a scale of

In this particular case, we can calculate the marginal posterior of analytically. Writing we have

Finally we can calculate

This is the non-standardized Student’s t-distribution .

Alternatively the marginal posterior of is

where is the standard t distribution with degrees of freedom.

The Model in Haskell

Following up on a comment from a previous blog post, let us try using the foldl package to calculate the length, the sum and the sum of squares traversing the list only once. An improvement on creating your own strict record and using foldl’ but maybe it is not suitable for some methods e.g. calculating the skewness and kurtosis incrementally, see below.

> x2Sum, xSum, n :: Double > (x2Sum, xSum, n) = L.fold stats xs > where > stats = (,,) <$> > (L.premap (\x -> x * x) L.sum) <*> > L.sum <*> > L.genericLength

And re-writing the sample variance using

we can then calculate the sample mean and variance using the sums we have just calculated.

> xBar, varX :: Double > xBar = xSum / n > varX = n * (m2Xs - xBar * xBar) / (n - 1) > where m2Xs = x2Sum / n

In random-fu, the Gamma distribution is specified by the rate paratmeter, .

> beta, initTau :: Double > beta = 0.5 * n * varX > initTau = evalState (sample (Gamma (n / 2) beta)) (pureMT 1)

Our sampler takes an old value of and creates new values of and .

> gibbsSampler :: MonadRandom m => Double -> m (Maybe ((Double, Double), Double)) > gibbsSampler oldTau = do > newMu <- sample (Normal xBar (recip (sqrt (n * oldTau)))) > let shape = 0.5 * n > scale = 0.5 * (x2Sum + n * newMu^2 - 2 * n * newMu * xBar) > newTau <- sample (Gamma shape (recip scale)) > return $ Just ((newMu, newTau), newTau)

From which we can create an infinite stream of samples.

> gibbsSamples :: [(Double, Double)] > gibbsSamples = evalState (ML.unfoldrM gibbsSampler initTau) (pureMT 1)

As our chains might be very long, we calculate the mean, variance, skewness and kurtosis using an incremental method.

> data Moments = Moments { mN :: !Double > , m1 :: !Double > , m2 :: !Double > , m3 :: !Double > , m4 :: !Double > } > deriving Show > moments :: [Double] -> Moments > moments xs = foldl' f (Moments 0.0 0.0 0.0 0.0 0.0) xs > where > f :: Moments -> Double -> Moments > f m x = Moments n' m1' m2' m3' m4' > where > n = mN m > n' = n + 1 > delta = x - (m1 m) > delta_n = delta / n' > delta_n2 = delta_n * delta_n > term1 = delta * delta_n * n > m1' = m1 m + delta_n > m4' = m4 m + > term1 * delta_n2 * (n'*n' - 3*n' + 3) + > 6 * delta_n2 * m2 m - 4 * delta_n * m3 m > m3' = m3 m + term1 * delta_n * (n' - 2) - 3 * delta_n * m2 m > m2' = m2 m + term1

In order to examine the posterior, we create a histogram.

> numBins :: Int > numBins = 400 > hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double) > hb = forceDouble -<< mkSimple (binD lower numBins upper) > where > lower = xBar - 2.0 * sqrt varX > upper = xBar + 2.0 * sqrt varX

And fill it with the specified number of samples preceeded by a burn-in.

> hist :: Histogram V.Vector BinD Double > hist = fillBuilder hb (take (nrep - nb) $ drop nb $ map fst gibbsSamples)

Now we can plot this.

And calculate the skewness and kurtosis.

> m :: Moments > m = moments (take (nrep - nb) $ drop nb $ map fst gibbsSamples) ghci> import Gibbs ghci> putStrLn $ show $ (sqrt (mN m)) * (m3 m) / (m2 m)**1.5 8.733959917065126e-4 ghci> putStrLn $ show $ (mN m) * (m4 m) / (m2 m)**2 3.451374739494607

We expect a skewness of 0 and a kurtosis of for . Not too bad.

The Model in JAGS

JAGS is a mature, declarative, domain specific language for building Bayesian statistical models using Gibbs sampling.

Here is our model as expressed in JAGS. Somewhat terse.

model { for (i in 1:N) { x[i] ~ dnorm(mu, tau) } mu ~ dnorm(0, 1.0E-6) tau <- pow(sigma, -2) sigma ~ dunif(0, 1000) }

To run it and examine its results, we wrap it up in some R

## Import the library that allows R to inter-work with jags. library(rjags) ## Read the simulated data into a data frame. fn <- read.table("example1.data", header=FALSE) jags <- jags.model('example1.bug', data = list('x' = fn[,1], 'N' = 20), n.chains = 4, n.adapt = 100) ## Burnin for 10000 samples update(jags, 10000); mcmc_samples <- coda.samples(jags, variable.names=c("mu", "sigma"), n.iter=20000) png(file="diagrams/jags.png",width=400,height=350) plot(mcmc_samples) dev.off()

And now we can look at the posterior for .

The Model in STAN

STAN is a domain specific language for building Bayesian statistical models similar to JAGS but newer and which allows variables to be re-assigned and so cannot really be described as declarative.

Here is our model as expressed in STAN. Again, somewhat terse.

data { int<lower=0> N; real x[N]; } parameters { real mu; real<lower=0,upper=1000> sigma; } model{ x ~ normal(mu, sigma); mu ~ normal(0, 1000); }

Just as with JAGS, to run it and examine its results, we wrap it up in some R.

library(rstan) ## Read the simulated data into a data frame. fn <- read.table("example1.data", header=FALSE) ## Running the model fit1 <- stan(file = 'Stan.stan', data = list('x' = fn[,1], 'N' = 20), pars=c("mu", "sigma"), chains=3, iter=30000, warmup=10000) png(file="diagrams/stan.png",width=400,height=350) plot(fit1) dev.off()

Again we can look at the posterior although we only seem to get medians and 80% intervals.

PostAmble

Write the histogram produced by the Haskell code to a file.

> displayHeader :: FilePath -> Diagram B R2 -> IO () > displayHeader fn = > mainRender ( DiagramOpts (Just 900) (Just 700) fn > , DiagramLoopOpts False Nothing 0 > ) > main :: IO () > main = do > displayHeader "diagrams/DataScienceHaskPost.png" > (barDiag > (zip (map fst $ asList hist) (map snd $ asList hist)))

The code can be downloaded from github.


Categories: Offsite Blogs

Douglas M. Auclair (geophf): 'A' is for Aleph-null

Planet Haskell - Tue, 04/08/2014 - 10:50pm

'A' is for ℵ0
Read: "'A' is for Aleph-null"
ℵ0 is a symbol for infinity. The first one, that is, as there are many infinities in mathematics, and many kinds of infinities, depending on which number system you choose to use (there are ... more than several kinds of numbers, but that may be the entry for 'N,' as in: "'N' is for 'Number.' But I digress, ...
... as usual).
My introductory post to my A-to-Z blog-writing challenge, or my Α-to-Ω blog-writing challenge, but that may be all Greek to you.
An appropriate entry because mathematics, itself, is as big as you want to make of it, or as small as you want to focus in on it. For example, infinity, one of them, ℵ0, is countably infinite. You take the first number (which happens to be 0 (zero)), and add one to it, and you get:
0, 1, 2, 3, ...
... and you keep going until you (don't) reach ℵ0.
The thing is, you can count this infinity. You just did.
But there are other infinities, including one you can't even count, because a 'bigger' infinity (it actually is bigger), is C, the continuum, because there numbers, such as:
π, τ, e, ...
Numbers that can't even be represented by a series of digits or by any function, even. They are the transcendental numbers, and are 'irrational.'
There is no way to rationalize the number π, for example; you just have to live with it, with all its quirkiness and all its irrationality.
So, how do you go along the numberline that includes irrational numbers? You can't. Why? Well, what's the 'next number' after π? There isn't one. It's not that we don't know what's the next number after π, it's just that there is no such thing, there is no 'very next number adjacent to π.'
Unless you invent a number system that counts along the continuum.
Good luck with that.
But, on the other hand, 4,000 years after Euclid, with his celebrated (or infamous) fifth postulate, said something along the lines of 'if two lines are extended into infinity on a plane and they never cross, then they kinda hafta be parallel,' people were still nodding their heads to that.
(The Ancient Greeks said 'kinda hafta' when they were dead serious about stuff like that.)
Then along came conic sections, and, lo, and behold, on those planes, it is possible, easily so, for two non-parallel lines, extended into infinity never to cross.
And the conics opened up whole new vistas of mathematics for us to explore.
So, one could say mathematics is a confining view; limiting, but the constraints are artificial: they are there because we put them there, and we put them there for some set of reasons, even if we don't know it or even if we forgot those reasons. If a particular set of mathematics doesn't do what we want it to do, or does it in an unforgivably cumbersome way, then, ... simply invent a new set of mathematics, see that it does do what we need it to do, and that, importantly, it doesn't do what we don't want it to do, and then we're good, right?
It's just that ... sometimes mathematics, being not particularly tiny — like our tiny, little, rigid brains — does things we don't expect and never look for, and so we get into trouble if we aren't rigorous. That happened to people who thought they invented complete and consistent systems. Frege invented a mathematics based on pure (predicate) logic, except it allowed the paradox of the 'set that has all sets (including itself)' Russell showed him this error.
So Russell invented the mathematics described in the Principia Mathematica, which most of think of, when we think of mathematics, and was rigorous about it, too. It took two-hundred pages of axioms and theories to prove that
1 + 1 = 2
And that holds, meaning that '1' is actually 1, '2' is actually 2, and '+' and '=' are what we'd like to think they are.
When Russell did that, he chorkled with glee, "Ha, we're good!" he shouted, "We have the big TOE! [theory of everything] Nothing that is inconsistent exists in this system."
Then, more than several years later, a little mathematician named Gödel did something amazing.
He said, "Really? Are you sure? Because ..."
And then he proved the system inconsistent.
How?
Well, it involves a little 50-page paper he published. Essentially what he did was, working entirely in Russell's system, he modelled mathematical formulae as numbers. He proved his modelling was consistent in that system, in that a Gödel-number mapped to a formula and that a formula mapped to that number, and that the numbers worked the way you expected the formulae to work.
Then he wrote a number that said: 'This formula (of truth) cannot be proved (is inconsistent) in Russell's system.'
And showed that number existed in Russell's system, a system that Russell showed, empirically, was consistent.
Russell's system, using Russell's axioms and theories, was provably inconsistent.
And Russell never saw that one coming.
Nor did most anyone else.
But Gödel accidentally showed that a system is either incomplete or inconsistent, or: a system cannot be both complete and consistent.
And that proof, way back when, opened the door to mathematics as we are wrestling with today, giving us Quantum theory, allowing us to do neat things, like write blog entries on this little thing we call a laptop.
So, that's that caveat to us mathematicians: we're working with a model that we created. We can stand up and say: 'Ha! This proves everything!' And it may be good for what we wanted, but all somebody has to do is remove the limits we've set to explode our neat, little rigorous system.
The good news is the explosion may be a good thing.
'A' is for ℵ0, but that (infinity) is just the start ...
Categories: Offsite Blogs

Lazy Ruby · effluence

del.icio.us/haskell - Tue, 04/08/2014 - 2:35pm
Categories: Offsite Blogs

Ketil Malde: Some not-so-grand challenges in bioinformatics

Planet Haskell - Tue, 04/08/2014 - 11:00am

Bioinformatics is a field in the intersection of biology, statistics, and computer science. Broadly speaking, biologists will plan an experiment, bioinformaticians will run the analysis, and the biologists will interpret the results. Sometimes it can be rather frustrating to be the bioinformatician squeezed into the middle, realizing that results are far from as robust as they should be.

Here, I try to list a number of areas in bioinformatics that I think are problematic. They are not in any way “grand challenges”, like computing protein structure or identifying non-trivial factors from GWAS analysis. These are more your everyday challenges that bioinformaticians need to be aware of, but are too polite to mention.

Importance of bioinformatics

High throughput sequencing (HTS) has rapidly become an indispensable tool in medicine, biology, and ecology. As a technology, it is crucial in addressing many of our most important challenges: health and disease, food supply and food production, and ecologoy. Scientifically, it has allowed us to reveal the mechanisms of living organisms in unprecedented detail and scale.

New technology invariably poses new challenges to analysis, and in addition to a thorough understanding of relevant biology, robust study design and analysis based on HTS requires a solid understanding of statistics and bioinformatics. A lack of this background often results in study design that is overly optimistic and ambitious, and in analysis that fails to take the many sources of noise and error into account.

Specific examples

The typical bioinformatics pipeline is (as implied by the name) structured as a chain of operations. The tools used will of course not be perfect, and they will introduce errors, artifacts, and biases. Unfortunately, the next stage in the pipeline usually works from the assumption that the results from the previous stage are correct, and errors thus propagate and accumulate - and in the end, it is difficult to say whether results represent real, biological phenomena, or are simply artifacts of the analysis. What makes matters worse, is that the pipeline is often executed by scientists with a superficial knowledge of the errors that can occur. (A computer said it, so it must be true.)

Gene prediction and transcript assembly

One important task in genomics is identifying the genes of new organisms. This can be performed with a variety of tools, and in one project, we used both ab initio gene prediction (MAKER and Augustus) as well as RNAseq assembly (CLC, abyss). The results varied substantially, with predicted gene counts ranging from 16000 to 45000. To make things worse, clustering revealed that there was not a simple many-to-one relationship between the predicted genes. Clearly, results from analysis of e.g. GO-enrichment or expansion and contraction of gene families will be heavily dependent on which set of transcripts one uses.

Expression analysis

Quantifying the expression of specific genes is an important tool in many studies. In addition to the challenges of acquiring a good set of transcripts, expression analysis based on RNAseq introduces several other problems.

The most commonly used measure, reads (or fragments) per kilobase per million (RPKM), has some statistical problems.

In addition, mapping is challenging, and in particular, mapping reads from transcripts to a genome will struggle with intron/exon boundaries. Of course, you can map to the transcriptome instead, but as we have seen, the reliability of reference transcriptomes are not always as we would like.

Reference genomes

The main goal of a de novo genome project is to produce a correct reference sequence for the genome. Although in many ways a simpler task than the corresponding transcriptome assembly, there are many challenges to overcome. In reality, most genome sequences are fragmented and contain errors. But even in the optimal case, a reference genome may not be representative.

For instance, a reference genome from a single individual is clearly unlikely to accurately model sex chromosomes of the other sex (but on the other hand, similarities between different sex chromosomes will likely lead to a lot of incorrectly mapped reads). But a single individual is often not representative even for other individuals of the same sex. E.g., an obvious method for identifying a sex chromosome is to compare sequence data from one sex to the draft reference (which in our case happened to be from an individual of the other sex) Unfortunately, it turned out that the marker we found to be absent in the reference - and thus inferred to be unique to the other sex – also occurred in 20% of the other sex. Just not in the sequenced individual.

So while reference genomes may work fairly well for some species, it is less useful for species with high genomic variation (or with B-chromosomes), and simply applying methods developed for mammals to different species will likely give misleading and/or incorrect results.

Genome annotation

I recently attended an interesting talk about pseudogenes in various species. One thing that stuck in my mind, was that the number of pseudogenes in a species is highly correlated with the institution responsible for the gene prediction. Although this is just an observation and not a scientific result, it seems very likely that different institutes use in-house pipelines with varying degrees of sensitivity/specificity. Thus what one pipeline would identify as a real gene, a more conservative pipeline would ignore as a pseudogene. So here too, our precious curated resources may be less valuable than you thought.

Population genomics

In population genomics, there are many different measures of diversity, but the most commonly used is FST. Unfortunately, the accuracy of estimating FST for a SNP is highly dependent on coverage and error rate, and a simulation study I did showed that with the coverage commonly used in projects, the estimated FST has a large variance.

In addition, the usual problems apply: the genome may not be complete, correct or representative. And even if it were, reads could still be mapped incorrectly. And it is more likely to fail in variable regions, meaning you get less accurate results exactly in the places it matters most.

Variant analysis

Variant analysis suffers from many of the same challenges as population genomics, genome assembly and mapping being what they are. Again, estimated allele frequencies tend to be wildly inaccurate.

Structural variants (non-trivial indels, transpositions, copy number variation, etc) are very common, but difficult to detect accurately from mapped reads. In many cases, such features will just result in lower mapping coverage, further reducing your chances of identifying them.

Curation is expensive

Analysis is challenging even under the best of conditions – with high quality data and using a good reference genome that is assembled and annotated and curated by a reputable institution or heavyweight consortium. But in many cases, data quality is poor and we don’t even have the luxury of good references.

A de novo genome project takes years and costs millions. This is acceptable for species crucial to medicine, like human or mouse, and for food production, like cow or wheat. These are sequenced by large multi-national consortia at tremendous expense. But while we will eventually get there for the more important species, the large majority of organisms – the less commercially or scientifically interesting ones – will never have anything close to this.

Similarly, the databases with unannotated sequences (e.g. UniProt) grow at a much quicker pace than the curated databases (e.g. SwissProt), simply because identifying the function of a protein in the lab takes approximately one post doc., and nobody has the incentives or manpower to do this.

Conclusions

So, what is the take-home message from all of this? Let’s take a moment to sum up:

Curation is less valuable than you think

Everybody loves and anticipates finished genomes, complete with annotated transcripts and proteins. But there’s a world of difference between the finished genomes of widely studied species and the draft genomes of the more obscure ones. Often, the reference resources are incomplete and partially incorrect, and sometimes the best thing that can be said for them, is that as long as we all use it our results will be comparable because we all make the same errors.

Our methods aren’t very robust

Genome assembly, gene annotation, transcriptome assembly, and functional annotation is difficult. Anybody can run some assembler or gene predictor and get a result, but if you run two of them, you will get two different results.

Sequence mapping is not very complicated (at a conference, somebody claimed to have counted over ninety programs for short read mapping), and works well if you have a good reference genome, if you don’t have too many repeats, and if you don’t have too much variation in your genome. In other words, everywhere you don’t really need it.

Dealing with the limitations

Being everyday challenges, I don’t think any of this is insurmountable, but I’ll save that for a later post. Or grant application. In the meantime, do let me know what you think.

Categories: Offsite Blogs

www.stephendiehl.com

del.icio.us/haskell - Tue, 04/08/2014 - 10:57am
Categories: Offsite Blogs

www.stephendiehl.com

del.icio.us/haskell - Tue, 04/08/2014 - 10:57am
Categories: Offsite Blogs

Yesod Web Framework: Proposal: Changes to the PVP

Planet Haskell - Tue, 04/08/2014 - 9:55am

As I mentioned two posts ago, there was a serious discussion on the libraries mailing list about the Package Versioning Policy (PVP).

This blog post presents some concrete changes I'd like to see to the PVP to make it better for both general consumers of Hackage, and for library authors as well. I'll start off with a summary of the changes, and then give the explanations:

  1. The goal of the PVP needs to be clarified. Its purpose is not to ensure reproducible builds of non-published software, but rather to provide for more reliable builds of libraries on Hackage. Reproducible builds should be handled exclusively through version freezing, the only known technique to actually give the necessary guarantees.

  2. Upper bounds should not be included on non-upgradeable packages, such as base and template-haskell (are there others?). Alternatively, we should establish some accepted upper bound on these packages, e.g. many people place base < 5 on their code.

  3. We should be distinguishing between mostly-stable packages and unstable packages. For a package like text, if you simply import Data.Text (Text, pack, reverse), or some other sane subset, there's no need for upper bounds.

    Note that this doesn't provide a hard-and-fast rule like the current PVP, but is rather a matter of discretion. Communication between library authors and users (via documentation or other means) would be vital to making this work well.

  4. For a package version A.B.C, a bump in A or B indicates some level of breaking change. As an opt-in approach, package authors are free to associated meaning to A and B beyond what the PVP requires. Libraries which use these packages are free to rely on the guarantees provided by package authors when placing upper bounds.

    Note that this is very related to point (3).

While I (Michael Snoyman) am the author of this proposal, the following people have reviewed the proposal and support it:

  • Bryan O'Sullivan
  • Felipe Lessa
  • Roman Cheplyaka
  • Vincent Hanquez
Reproducible builds

There are a number of simple cases that can result in PVP-compliant code not being buildable. These aren't just hypothetical cases; in my experience as both a package author and Stackage maintainer, I've seen these come up.

  • Package foo version 1.0 provides an instance for MonadFoo for IO and Identity. Version 1.1 removes the IO instance for some reason. Package bar provides a function:

    bar :: MonadFoo m => Int -> m Double

    Package bar compiles with both version 1.0 and 1.1 of foo, and therefore (following the PVP) adds a constraint to its cabal file foo >= 1.0 && < 1.2.

    Now a user decides to use the bar package. The user never imports anything from foo, and therefore has no listing for foo in the cabal file. The user code depends on the IO instance for MonadFoo. When compiled with foo 1.0, everything works fine. However, when compiled with foo 1.1, the code no longer compiles.

  • Similarly, instead of typeclass instances, the same situation can occur with module export lists. Consider version 1.0 of foo which provides:

    module Foo (foo1, foo2) where

    Version 1.1 removes the foo2 export. The bar package reexports the entire Foo module, and then a user package imports the module from bar. If the user package uses the foo2 function, it will compile when foo-1.0 is used, but not when foo-1.1 is used.

In both of these cases, the issue is the same: transitive dependencies are not being clamped down. The PVP makes an assumption that the entire interface for a package can be expressed in its version number, which is not true. I see three possible solutions to this:

  1. Try to push even more of a burden onto package authors, and somehow make them guarantee that their interface is completely immune to changes elsewhere in the stack. This kind of change was proposed on the libraries list. I'm strongly opposed to some kind of change like this: it makes authors' lives harder, and makes it very difficult to provide backwards compatibility in libraries. Imagine if transformers 0.4 adds a new MonadIO instance; the logical extreme of this position would be to disallow a library from working with both transformers 0.3 and 0.4, which will split Hackage in two.

  2. Modify the PVP so that instead of listing just direct dependencies, authors are required to list all transitive dependencies as well. So it would be a violation to depend on bar without explicitly listing foo in the dependency list. This will work, and be incredibly difficult to maintain. It will also greatly increase the time it takes for a new version of a deep dependency to be usable due to the number of authors who will have to bump version bounds.

  3. Transfer responsibility for this to package users: if you first built your code against foo 1.0, you should freeze that information and continue building against foo 1.0, regardless of the presence of new versions of foo. Not only does this increase reproducibility, it's just common sense: it's entirely possible that new versions of a library will introduce a runtime bug, performance regression, or even fix a bug that your code depends on. Why should the reliability of my code base be dependent on the actions of some third party that I have no control over?

Non-upgradeable packages

There are some packages which ship with GHC and cannot be upgraded. I'm aware of at least base and template-haskell, though perhaps there are others (haskell98 and haskell2010?). In the past, there was good reason to place upper bounds on base, specifically with the base 3/4 split. However, we haven't had that experience in a while, and don't seem to be moving towards doing that again. In today's world, we end up with the following options:

  • Place upper bounds on base to indicate "I haven't tested this with newer versions of GHC." This then makes it difficult for users to test out that package with newer versions of GHC.
  • Leave off upper bounds on base. Users may then try to install a package onto a version of GHC on which the package hasn't been tested, which will result in either (1) everything working (definitely the common case based on my Stackage maintenance), or (2) getting a compilation error.

I've heard two arguments to push us in the direction of keeping the upper bounds in this case, so I'd like to address them both:

  • cabal error messages are easier to understand than GHC error messages. I have two problems with that:
    • I disagree: cabal error messages are terrible. (I'm told this will be fixed in the next version of cabal.) Take the following output as a sample:

      cabal: Could not resolve dependencies: trying: 4Blocks-0.2 rejecting: base-4.6.0.1/installed-8aa... (conflict: 4Blocks => base>=2 && <=4) rejecting: base-4.6.0.1, 4.6.0.0, 4.5.1.0, 4.5.0.0, 4.4.1.0, 4.4.0.0, 4.3.1.0, 4.3.0.0, 4.2.0.2, 4.2.0.1, 4.2.0.0, 4.1.0.0, 4.0.0.0, 3.0.3.2, 3.0.3.1 (global constraint requires installed instance)

      I've seen a number of users file bug reports not understanding that this message means "you have the wrong version of GHC."

    • Even if the error messages were more user-friendly, they make it more difficult to fix the actual problem: the code doesn't compile with the new version of GHC. Often times, I've been able to report an error message to a library author and, without necessarily even downloading the new version of GHC, he/she has been able to fix the problem.

  • Using upper bounds in theory means that cabal will be able to revert to an older version of the library that is compatible with the new version of GHC. However, I find it highly unlikely that there's often- if ever- a case where an older version of a library is compatible with a later version of GHC.

Mostly-stable, and finer-grained versioning

I'll combine the discussion of the last two points. I think the heart of the PVP debates really comes from mostly-stable packages. Let's contrast with the extremes. Consider a library which is completely stable, never has a breaking change, and has stated with absolute certainty that it never will again. Does anyone care about upper bounds on this library? They're irrelevant! I'd have no problem with including an upper bound, and I doubt even the staunchest PVP advocates would really claim it's a problem to leave it off.

On the other hand, consider an extremely unstable library, which is releasing massively breaking changes on a weekly basis. I would certainly agree in that case that an upper bound on that library is highly prudent.

The sticking point is the middle ground. Consider the following code snippet:

import Data.Text (Text, pack) foo :: Text foo = pack "foo"

According to the PVP as it stands today, this snippet requires an upper bound of < 1.2 on the text package. But let's just play the odds here: does anyone actually believe there's a real chance that the next iteration of text will break this code snippet? I highly doubt it; this is a stable subset of the text API, and I doubt it will ever be changing. The same can be said of large subsets of many other packages.

By putting in upper bounds in these cases, we run a very real risk of bifurcating Hackage into "those demanding the new text version for some new feature" vs "those who haven't yet updated their upper bounds to allow the new version of text."

The PVP currently takes an extremely conservative viewpoint on this, with the goal of solving just one problem: making sure code that compiles now continues to compile. As I demonstrated above, it doesn't actually solve that problem completely. And in addition, in this process, it has created other problems, such as this bifurcation.

So my proposal is that, instead of creating rigid rules like "always put an upper bound no matter what," we allow some common sense into the process, and also let package authors explicitly say "you can rely on this API not changing."

Categories: Offsite Blogs