# News aggregator

### Dominic Steinitz: Gibbs Sampling in R, Haskell, Jags and Stan

It’s possible to Gibbs sampling in most languages and since I am doing some work in R and some work in Haskell, I thought I’d present a simple example in both languages: estimating the mean from a normal distribution with unknown mean and variance. Although one can do Gibbs sampling directly in R, it is more common to use a specialised language such as JAGS or STAN to do the actual sampling and do pre-processing and post-processing in R. This blog post presents implementations in native R, JAGS and STAN as well as Haskell.

Preamble > {-# OPTIONS_GHC -Wall #-} > {-# OPTIONS_GHC -fno-warn-name-shadowing #-} > {-# OPTIONS_GHC -fno-warn-type-defaults #-} > {-# OPTIONS_GHC -fno-warn-unused-do-bind #-} > {-# OPTIONS_GHC -fno-warn-missing-methods #-} > {-# OPTIONS_GHC -fno-warn-orphans #-} > {-# LANGUAGE NoMonomorphismRestriction #-} > module Gibbs ( > main > , m > , Moments(..) > ) where > > import qualified Data.Vector.Unboxed as V > import qualified Control.Monad.Loops as ML > import Data.Random.Source.PureMT > import Data.Random > import Control.Monad.State > import Data.Histogram ( asList ) > import Data.Histogram.Fill > import Data.Histogram.Generic ( Histogram ) > import Data.List > import qualified Control.Foldl as L > > import Diagrams.Backend.Cairo.CmdLine > > import LinRegAux > > import Diagrams.Backend.CmdLine > import Diagrams.Prelude hiding ( sample, render )The length of our chain and the burn-in.

> nrep, nb :: Int > nb = 5000 > nrep = 105000Data generated from .

> xs :: [Double] > xs = [ > 11.0765808082301 > , 10.918739177542 > , 15.4302462747137 > , 10.1435649220266 > , 15.2112705014697 > , 10.441327659703 > , 2.95784054883142 > , 10.2761068139607 > , 9.64347295100318 > , 11.8043359297675 > , 10.9419989262713 > , 7.21905367667346 > , 10.4339807638017 > , 6.79485294803006 > , 11.817248658832 > , 6.6126710570584 > , 12.6640920214508 > , 8.36604701073303 > , 12.6048485320333 > , 8.43143879537592 > ] A Bit of Theory Gibbs SamplingFor a multi-parameter situation, Gibbs sampling is a special case of Metropolis-Hastings in which the proposal distributions are the posterior conditional distributions.

Referring back to the explanation of the metropolis algorithm, let us describe the state by its parameters and the conditional posteriors by where then

where we have used the rules of conditional probability and the fact that

Thus we always accept the proposed jump. Note that the chain is not in general reversible as the order in which the updates are done matters.

Normal Distribution with Unknown Mean and VarianceIt is fairly standard to use an improper prior

The likelihood is

re-writing in terms of precision

Thus the posterior is

We can re-write the sum in terms of the sample mean and variance using

Thus the conditional posterior for is

which we recognise as a normal distribution with mean of and a variance of .

The conditional posterior for is

which we recognise as a gamma distribution with a shape of and a scale of

In this particular case, we can calculate the marginal posterior of analytically. Writing we have

Finally we can calculate

This is the non-standardized Student’s t-distribution .

Alternatively the marginal posterior of is

where is the standard t distribution with degrees of freedom.

The Model in HaskellFollowing up on a comment from a previous blog post, let us try using the foldl package to calculate the length, the sum and the sum of squares traversing the list only once. An improvement on creating your own strict record and using *foldl’* but maybe it is not suitable for some methods e.g. calculating the skewness and kurtosis incrementally, see below.

And re-writing the sample variance using

we can then calculate the sample mean and variance using the sums we have just calculated.

> xBar, varX :: Double > xBar = xSum / n > varX = n * (m2Xs - xBar * xBar) / (n - 1) > where m2Xs = x2Sum / nIn random-fu, the Gamma distribution is specified by the rate paratmeter, .

> beta, initTau :: Double > beta = 0.5 * n * varX > initTau = evalState (sample (Gamma (n / 2) beta)) (pureMT 1)Our sampler takes an old value of and creates new values of and .

> gibbsSampler :: MonadRandom m => Double -> m (Maybe ((Double, Double), Double)) > gibbsSampler oldTau = do > newMu <- sample (Normal xBar (recip (sqrt (n * oldTau)))) > let shape = 0.5 * n > scale = 0.5 * (x2Sum + n * newMu^2 - 2 * n * newMu * xBar) > newTau <- sample (Gamma shape (recip scale)) > return $ Just ((newMu, newTau), newTau)From which we can create an infinite stream of samples.

> gibbsSamples :: [(Double, Double)] > gibbsSamples = evalState (ML.unfoldrM gibbsSampler initTau) (pureMT 1)As our chains might be very long, we calculate the mean, variance, skewness and kurtosis using an incremental method.

> data Moments = Moments { mN :: !Double > , m1 :: !Double > , m2 :: !Double > , m3 :: !Double > , m4 :: !Double > } > deriving Show > moments :: [Double] -> Moments > moments xs = foldl' f (Moments 0.0 0.0 0.0 0.0 0.0) xs > where > f :: Moments -> Double -> Moments > f m x = Moments n' m1' m2' m3' m4' > where > n = mN m > n' = n + 1 > delta = x - (m1 m) > delta_n = delta / n' > delta_n2 = delta_n * delta_n > term1 = delta * delta_n * n > m1' = m1 m + delta_n > m4' = m4 m + > term1 * delta_n2 * (n'*n' - 3*n' + 3) + > 6 * delta_n2 * m2 m - 4 * delta_n * m3 m > m3' = m3 m + term1 * delta_n * (n' - 2) - 3 * delta_n * m2 m > m2' = m2 m + term1In order to examine the posterior, we create a histogram.

> numBins :: Int > numBins = 400 > hb :: HBuilder Double (Data.Histogram.Generic.Histogram V.Vector BinD Double) > hb = forceDouble -<< mkSimple (binD lower numBins upper) > where > lower = xBar - 2.0 * sqrt varX > upper = xBar + 2.0 * sqrt varXAnd fill it with the specified number of samples preceeded by a burn-in.

> hist :: Histogram V.Vector BinD Double > hist = fillBuilder hb (take (nrep - nb) $ drop nb $ map fst gibbsSamples)Now we can plot this.

And calculate the skewness and kurtosis.

> m :: Moments > m = moments (take (nrep - nb) $ drop nb $ map fst gibbsSamples) ghci> import Gibbs ghci> putStrLn $ show $ (sqrt (mN m)) * (m3 m) / (m2 m)**1.5 8.733959917065126e-4 ghci> putStrLn $ show $ (mN m) * (m4 m) / (m2 m)**2 3.451374739494607We expect a skewness of 0 and a kurtosis of for . Not too bad.

The Model in JAGSJAGS is a mature, declarative, domain specific language for building Bayesian statistical models using Gibbs sampling.

Here is our model as expressed in JAGS. Somewhat terse.

model { for (i in 1:N) { x[i] ~ dnorm(mu, tau) } mu ~ dnorm(0, 1.0E-6) tau <- pow(sigma, -2) sigma ~ dunif(0, 1000) }To run it and examine its results, we wrap it up in some R

## Import the library that allows R to inter-work with jags. library(rjags) ## Read the simulated data into a data frame. fn <- read.table("example1.data", header=FALSE) jags <- jags.model('example1.bug', data = list('x' = fn[,1], 'N' = 20), n.chains = 4, n.adapt = 100) ## Burnin for 10000 samples update(jags, 10000); mcmc_samples <- coda.samples(jags, variable.names=c("mu", "sigma"), n.iter=20000) png(file="diagrams/jags.png",width=400,height=350) plot(mcmc_samples) dev.off()And now we can look at the posterior for .

The Model in STANSTAN is a domain specific language for building Bayesian statistical models similar to JAGS but newer and which allows variables to be re-assigned and so cannot really be described as declarative.

Here is our model as expressed in STAN. Again, somewhat terse.

data { int<lower=0> N; real x[N]; } parameters { real mu; real<lower=0,upper=1000> sigma; } model{ x ~ normal(mu, sigma); mu ~ normal(0, 1000); }Just as with JAGS, to run it and examine its results, we wrap it up in some R.

library(rstan) ## Read the simulated data into a data frame. fn <- read.table("example1.data", header=FALSE) ## Running the model fit1 <- stan(file = 'Stan.stan', data = list('x' = fn[,1], 'N' = 20), pars=c("mu", "sigma"), chains=3, iter=30000, warmup=10000) png(file="diagrams/stan.png",width=400,height=350) plot(fit1) dev.off()Again we can look at the posterior although we only seem to get medians and 80% intervals.

PostAmbleWrite the histogram produced by the Haskell code to a file.

> displayHeader :: FilePath -> Diagram B R2 -> IO () > displayHeader fn = > mainRender ( DiagramOpts (Just 900) (Just 700) fn > , DiagramLoopOpts False Nothing 0 > ) > main :: IO () > main = do > displayHeader "diagrams/DataScienceHaskPost.png" > (barDiag > (zip (map fst $ asList hist) (map snd $ asList hist)))The code can be downloaded from github.

### Douglas M. Auclair (geophf): 'A' is for Aleph-null

'A' is for ℵ0

Read: "'A' is for Aleph-null"

ℵ0 is a symbol for infinity. The first one, that is, as there are many infinities in mathematics, and many

*kinds*of infinities, depending on which number system you choose to use (there are ... more than several kinds of numbers, but that may be the entry for 'N,' as in: "'N' is for 'Number.' But I digress, ...

... as usual).

My introductory post to my A-to-Z blog-writing challenge, or my Α-to-Ω blog-writing challenge, but that may be all Greek to you.

An appropriate entry because mathematics, itself, is as big as you want to make of it, or as small as you want to focus in on it. For example, infinity, one of them, ℵ0, is

*countably*infinite. You take the first number (which happens to be 0 (zero)), and add one to it, and you get:

0, 1, 2, 3, ...

... and you keep going until you (don't) reach ℵ0.

The thing is, you

*can*count this infinity. You just did.

But there are other infinities, including one you can't even count, because a 'bigger' infinity (it actually is bigger), is C, the continuum, because there numbers, such as:

π, τ, e, ...

Numbers that can't even be represented by a series of digits or by any function, even. They are the transcendental numbers, and are 'irrational.'

There is no way to rationalize the number π, for example; you just have to live with it, with all its quirkiness and all its irrationality.

So, how do you go along the numberline that includes irrational numbers? You can't. Why? Well, what's the 'next number' after π? There isn't one. It's not that we don't know what's the next number after π, it's just that there is no such thing, there is no 'very next number adjacent to π.'

Unless you invent a number system that counts along the continuum.

Good luck with that.

But, on the other hand, 4,000 years after Euclid, with his celebrated (or infamous) fifth postulate, said something along the lines of 'if two lines are extended into infinity on a plane and they never cross, then they kinda hafta be parallel,' people were still nodding their heads to that.

(The Ancient Greeks said 'kinda hafta' when they were dead serious about stuff like that.)

Then along came conic sections, and, lo, and behold, on those planes, it is possible, easily so, for two non-parallel lines, extended into infinity never to cross.

And the conics opened up whole new vistas of mathematics for us to explore.

So, one could say mathematics is a confining view; limiting, but the constraints are artificial: they are there because we put them there, and we put them there for some set of reasons, even if we don't know it or even if we forgot those reasons. If a particular set of mathematics doesn't do what we want it to do, or does it in an unforgivably cumbersome way, then, ... simply invent a new set of mathematics, see that it does do what we need it to do, and that, importantly, it doesn't do what we don't want it to do, and then we're good, right?

It's just that ... sometimes mathematics, being not particularly tiny — like our tiny, little, rigid brains — does things we don't expect and never look for, and so we get into trouble if we aren't rigorous. That happened to people who thought they invented complete

*and*consistent systems. Frege invented a mathematics based on pure (predicate) logic, except it allowed the paradox of the 'set that has all sets (including itself)' Russell showed him this error.

So Russell invented the mathematics described in the

*Principia Mathematica,*which most of think of, when we think of mathematics, and was rigorous about it, too. It took two-hundred pages of axioms and theories to prove that

1 + 1 = 2

And that holds, meaning that '1' is actually 1, '2' is actually 2, and '+' and '=' are what we'd like to think they are.

When Russell did that, he chorkled with glee,

*"Ha, we're good!"*he shouted,

*"We have the big TOE!*[theory of everything]

*Nothing that is inconsistent exists in this system."*

Then, more than several years later, a little mathematician named Gödel did something amazing.

He said, "Really? Are you sure? Because ..."

And then he proved the system inconsistent.

How?

Well, it involves a little 50-page paper he published. Essentially what he did was, working entirely in Russell's system, he modelled mathematical formulae as numbers. He proved his modelling was consistent in that system, in that a Gödel-number mapped to a formula and that a formula mapped to that number, and that the numbers worked the way you expected the formulae to work.

Then he wrote a number that said: 'This formula (of truth) cannot be proved (is inconsistent) in Russell's system.'

And showed that number existed in Russell's system, a system that Russell showed, empirically, was consistent.

Russell's system, using Russell's axioms and theories, was provably inconsistent.

And Russell never saw that one coming.

Nor did most anyone else.

But Gödel accidentally showed that a system is

*either*incomplete

*or*inconsistent, or: a system

*cannot*be both complete and consistent.

And that proof, way back when, opened the door to mathematics as we are wrestling with today, giving us Quantum theory, allowing us to do neat things, like write blog entries on this little thing we call a laptop.

So, that's that caveat to us mathematicians: we're working with a model that we created. We can stand up and say:

*'Ha! This proves everything!'*And it may be good for what we wanted, but all somebody has to do is remove the limits we've set to explode our neat, little rigorous system.

The good news is the explosion

*may*be a good thing.

'A' is for ℵ0, but that (infinity) is just the start ...

### Lazy Ruby · effluence

### Proposal: Changes to the PVP

### GHC 7.8.1 release has been cut

### Ketil Malde: Some not-so-grand challenges in bioinformatics

Bioinformatics is a field in the intersection of biology, statistics, and computer science. Broadly speaking, biologists will plan an experiment, bioinformaticians will run the analysis, and the biologists will interpret the results. Sometimes it can be rather frustrating to be the bioinformatician squeezed into the middle, realizing that results are far from as robust as they should be.

Here, I try to list a number of areas in bioinformatics that I think are problematic. They are not in any way “grand challenges”, like computing protein structure or identifying non-trivial factors from GWAS analysis. These are more your *everyday* challenges that bioinformaticians need to be aware of, but are too polite to mention.

High throughput sequencing (HTS) has rapidly become an indispensable tool in medicine, biology, and ecology. As a technology, it is crucial in addressing many of our most important challenges: health and disease, food supply and food production, and ecologoy. Scientifically, it has allowed us to reveal the mechanisms of living organisms in unprecedented detail and scale.

New technology invariably poses new challenges to analysis, and in addition to a thorough understanding of relevant biology, robust study design and analysis based on HTS requires a solid understanding of statistics and bioinformatics. A lack of this background often results in study design that is overly optimistic and ambitious, and in analysis that fails to take the many sources of noise and error into account.

Specific examplesThe typical bioinformatics pipeline is (as implied by the name) structured as a chain of operations. The tools used will of course not be perfect, and they will introduce errors, artifacts, and biases. Unfortunately, the next stage in the pipeline usually works from the assumption that the results from the previous stage are correct, and errors thus propagate and accumulate - and in the end, it is difficult to say whether results represent real, biological phenomena, or are simply artifacts of the analysis. What makes matters worse, is that the pipeline is often executed by scientists with a superficial knowledge of the errors that can occur. (A computer said it, so it must be true.)

Gene prediction and transcript assemblyOne important task in genomics is identifying the genes of new organisms. This can be performed with a variety of tools, and in one project, we used both *ab initio* gene prediction (MAKER and Augustus) as well as RNAseq assembly (CLC, abyss). The results varied substantially, with predicted gene counts ranging from 16000 to 45000. To make things worse, clustering revealed that there was not a simple many-to-one relationship between the predicted genes. Clearly, results from analysis of e.g. GO-enrichment or expansion and contraction of gene families will be heavily dependent on which set of transcripts one uses.

Quantifying the expression of specific genes is an important tool in many studies. In addition to the challenges of acquiring a good set of transcripts, expression analysis based on RNAseq introduces several other problems.

The most commonly used measure, reads (or fragments) per kilobase per million (RPKM), has some statistical problems.

In addition, mapping is challenging, and in particular, mapping reads from transcripts to a genome will struggle with intron/exon boundaries. Of course, you can map to the transcriptome instead, but as we have seen, the reliability of reference transcriptomes are not always as we would like.

Reference genomesThe main goal of a *de novo* genome project is to produce a correct reference sequence for the genome. Although in many ways a simpler task than the corresponding transcriptome assembly, there are many challenges to overcome. In reality, most genome sequences are fragmented and contain errors. But even in the optimal case, a reference genome may not be representative.

For instance, a reference genome from a single individual is clearly unlikely to accurately model sex chromosomes of the other sex (but on the other hand, similarities between different sex chromosomes will likely lead to a lot of incorrectly mapped reads). But a single individual is often not representative even for other individuals of the same sex. E.g., an obvious method for identifying a sex chromosome is to compare sequence data from one sex to the draft reference (which in our case happened to be from an individual of the other sex) Unfortunately, it turned out that the marker we found to be absent in the reference - and thus inferred to be unique to the other sex – also occurred in 20% of the other sex. Just not in the sequenced individual.

So while reference genomes may work fairly well for some species, it is less useful for species with high genomic variation (or with B-chromosomes), and simply applying methods developed for mammals to different species will likely give misleading and/or incorrect results.

Genome annotationI recently attended an interesting talk about pseudogenes in various species. One thing that stuck in my mind, was that the number of pseudogenes in a species is highly correlated with the institution responsible for the gene prediction. Although this is just an observation and not a scientific result, it seems very likely that different institutes use in-house pipelines with varying degrees of sensitivity/specificity. Thus what one pipeline would identify as a real gene, a more conservative pipeline would ignore as a pseudogene. So here too, our precious curated resources may be less valuable than you thought.

Population genomicsIn population genomics, there are many different measures of diversity, but the most commonly used is FST. Unfortunately, the accuracy of estimating FST for a SNP is highly dependent on coverage and error rate, and a simulation study I did showed that with the coverage commonly used in projects, the estimated FST has a large variance.

In addition, the usual problems apply: the genome may not be complete, correct or representative. And even if it were, reads could still be mapped incorrectly. And it is *more* likely to fail in variable regions, meaning you get *less* accurate results exactly in the places it matters most.

Variant analysis suffers from many of the same challenges as population genomics, genome assembly and mapping being what they are. Again, estimated allele frequencies tend to be wildly inaccurate.

Structural variants (non-trivial indels, transpositions, copy number variation, etc) are very common, but difficult to detect accurately from mapped reads. In many cases, such features will just result in lower mapping coverage, further reducing your chances of identifying them.

Curation is expensiveAnalysis is challenging even under the best of conditions – with high quality data and using a good reference genome that is assembled and annotated and curated by a reputable institution or heavyweight consortium. But in many cases, data quality is poor and we don’t even have the luxury of good references.

A de novo genome project takes years and costs millions. This is acceptable for species crucial to medicine, like human or mouse, and for food production, like cow or wheat. These are sequenced by large multi-national consortia at tremendous expense. But while we will eventually get there for the more important species, the large majority of organisms – the less commercially or scientifically interesting ones – will never have anything close to this.

Similarly, the databases with unannotated sequences (e.g. UniProt) grow at a much quicker pace than the curated databases (e.g. SwissProt), simply because identifying the function of a protein in the lab takes approximately one post doc., and nobody has the incentives or manpower to do this.

ConclusionsSo, what is the take-home message from all of this? Let’s take a moment to sum up:

Curation is less valuable than you thinkEverybody loves and anticipates finished genomes, complete with annotated transcripts and proteins. But there’s a world of difference between the finished genomes of widely studied species and the draft genomes of the more obscure ones. Often, the reference resources are incomplete and partially incorrect, and sometimes the best thing that can be said for them, is that as long as we all use it our results will be comparable because we all make the same errors.

Our methods aren’t very robustGenome assembly, gene annotation, transcriptome assembly, and functional annotation is *difficult*. Anybody can run some assembler or gene predictor and get a result, but if you run two of them, you will get two different results.

Sequence mapping is not very complicated (at a conference, somebody claimed to have counted over ninety programs for short read mapping), and works well if you have a good reference genome, if you don’t have too many repeats, and if you don’t have too much variation in your genome. In other words, everywhere you don’t really need it.

Dealing with the limitationsBeing everyday challenges, I don’t think any of this is insurmountable, but I’ll save that for a later post. Or grant application. In the meantime, do let me know what you think.

### www.stephendiehl.com

### www.stephendiehl.com

### New gtk2hs 0.12.4 release

Thanks to John Lato and Duncan Coutts for the latest bugfix release! The latest packages should be buildable on GHC 7.6, and the cairo package should behave a bit nicer in ghci on Windows. Thanks to all!

~d