News aggregator

Ketil Malde: Some not-so-grand challenges in bioinformatics

Planet Haskell - Tue, 04/08/2014 - 11:00am

Bioinformatics is a field in the intersection of biology, statistics, and computer science. Broadly speaking, biologists will plan an experiment, bioinformaticians will run the analysis, and the biologists will interpret the results. Sometimes it can be rather frustrating to be the bioinformatician squeezed into the middle, realizing that results are far from as robust as they should be.

Here, I try to list a number of areas in bioinformatics that I think are problematic. They are not in any way “grand challenges”, like computing protein structure or identifying non-trivial factors from GWAS analysis. These are more your everyday challenges that bioinformaticians need to be aware of, but are too polite to mention.

Importance of bioinformatics

High throughput sequencing (HTS) has rapidly become an indispensable tool in medicine, biology, and ecology. As a technology, it is crucial in addressing many of our most important challenges: health and disease, food supply and food production, and ecologoy. Scientifically, it has allowed us to reveal the mechanisms of living organisms in unprecedented detail and scale.

New technology invariably poses new challenges to analysis, and in addition to a thorough understanding of relevant biology, robust study design and analysis based on HTS requires a solid understanding of statistics and bioinformatics. A lack of this background often results in study design that is overly optimistic and ambitious, and in analysis that fails to take the many sources of noise and error into account.

Specific examples

The typical bioinformatics pipeline is (as implied by the name) structured as a chain of operations. The tools used will of course not be perfect, and they will introduce errors, artifacts, and biases. Unfortunately, the next stage in the pipeline usually works from the assumption that the results from the previous stage are correct, and errors thus propagate and accumulate - and in the end, it is difficult to say whether results represent real, biological phenomena, or are simply artifacts of the analysis. What makes matters worse, is that the pipeline is often executed by scientists with a superficial knowledge of the errors that can occur. (A computer said it, so it must be true.)

Gene prediction and transcript assembly

One important task in genomics is identifying the genes of new organisms. This can be performed with a variety of tools, and in one project, we used both ab initio gene prediction (MAKER and Augustus) as well as RNAseq assembly (CLC, abyss). The results varied substantially, with predicted gene counts ranging from 16000 to 45000. To make things worse, clustering revealed that there was not a simple many-to-one relationship between the predicted genes. Clearly, results from analysis of e.g. GO-enrichment or expansion and contraction of gene families will be heavily dependent on which set of transcripts one uses.

Expression analysis

Quantifying the expression of specific genes is an important tool in many studies. In addition to the challenges of acquiring a good set of transcripts, expression analysis based on RNAseq introduces several other problems.

The most commonly used measure, reads (or fragments) per kilobase per million (RPKM), has some statistical problems.

In addition, mapping is challenging, and in particular, mapping reads from transcripts to a genome will struggle with intron/exon boundaries. Of course, you can map to the transcriptome instead, but as we have seen, the reliability of reference transcriptomes are not always as we would like.

Reference genomes

The main goal of a de novo genome project is to produce a correct reference sequence for the genome. Although in many ways a simpler task than the corresponding transcriptome assembly, there are many challenges to overcome. In reality, most genome sequences are fragmented and contain errors. But even in the optimal case, a reference genome may not be representative.

For instance, a reference genome from a single individual is clearly unlikely to accurately model sex chromosomes of the other sex (but on the other hand, similarities between different sex chromosomes will likely lead to a lot of incorrectly mapped reads). But a single individual is often not representative even for other individuals of the same sex. E.g., an obvious method for identifying a sex chromosome is to compare sequence data from one sex to the draft reference (which in our case happened to be from an individual of the other sex) Unfortunately, it turned out that the marker we found to be absent in the reference - and thus inferred to be unique to the other sex – also occurred in 20% of the other sex. Just not in the sequenced individual.

So while reference genomes may work fairly well for some species, it is less useful for species with high genomic variation (or with B-chromosomes), and simply applying methods developed for mammals to different species will likely give misleading and/or incorrect results.

Genome annotation

I recently attended an interesting talk about pseudogenes in various species. One thing that stuck in my mind, was that the number of pseudogenes in a species is highly correlated with the institution responsible for the gene prediction. Although this is just an observation and not a scientific result, it seems very likely that different institutes use in-house pipelines with varying degrees of sensitivity/specificity. Thus what one pipeline would identify as a real gene, a more conservative pipeline would ignore as a pseudogene. So here too, our precious curated resources may be less valuable than you thought.

Population genomics

In population genomics, there are many different measures of diversity, but the most commonly used is FST. Unfortunately, the accuracy of estimating FST for a SNP is highly dependent on coverage and error rate, and a simulation study I did showed that with the coverage commonly used in projects, the estimated FST has a large variance.

In addition, the usual problems apply: the genome may not be complete, correct or representative. And even if it were, reads could still be mapped incorrectly. And it is more likely to fail in variable regions, meaning you get less accurate results exactly in the places it matters most.

Variant analysis

Variant analysis suffers from many of the same challenges as population genomics, genome assembly and mapping being what they are. Again, estimated allele frequencies tend to be wildly inaccurate.

Structural variants (non-trivial indels, transpositions, copy number variation, etc) are very common, but difficult to detect accurately from mapped reads. In many cases, such features will just result in lower mapping coverage, further reducing your chances of identifying them.

Curation is expensive

Analysis is challenging even under the best of conditions – with high quality data and using a good reference genome that is assembled and annotated and curated by a reputable institution or heavyweight consortium. But in many cases, data quality is poor and we don’t even have the luxury of good references.

A de novo genome project takes years and costs millions. This is acceptable for species crucial to medicine, like human or mouse, and for food production, like cow or wheat. These are sequenced by large multi-national consortia at tremendous expense. But while we will eventually get there for the more important species, the large majority of organisms – the less commercially or scientifically interesting ones – will never have anything close to this.

Similarly, the databases with unannotated sequences (e.g. UniProt) grow at a much quicker pace than the curated databases (e.g. SwissProt), simply because identifying the function of a protein in the lab takes approximately one post doc., and nobody has the incentives or manpower to do this.

Conclusions

So, what is the take-home message from all of this? Let’s take a moment to sum up:

Curation is less valuable than you think

Everybody loves and anticipates finished genomes, complete with annotated transcripts and proteins. But there’s a world of difference between the finished genomes of widely studied species and the draft genomes of the more obscure ones. Often, the reference resources are incomplete and partially incorrect, and sometimes the best thing that can be said for them, is that as long as we all use it our results will be comparable because we all make the same errors.

Our methods aren’t very robust

Genome assembly, gene annotation, transcriptome assembly, and functional annotation is difficult. Anybody can run some assembler or gene predictor and get a result, but if you run two of them, you will get two different results.

Sequence mapping is not very complicated (at a conference, somebody claimed to have counted over ninety programs for short read mapping), and works well if you have a good reference genome, if you don’t have too many repeats, and if you don’t have too much variation in your genome. In other words, everywhere you don’t really need it.

Dealing with the limitations

Being everyday challenges, I don’t think any of this is insurmountable, but I’ll save that for a later post. Or grant application. In the meantime, do let me know what you think.

Categories: Offsite Blogs

www.stephendiehl.com

del.icio.us/haskell - Tue, 04/08/2014 - 10:57am
Categories: Offsite Blogs

www.stephendiehl.com

del.icio.us/haskell - Tue, 04/08/2014 - 10:57am
Categories: Offsite Blogs

Yesod Web Framework: Proposal: Changes to the PVP

Planet Haskell - Tue, 04/08/2014 - 9:55am

As I mentioned two posts ago, there was a serious discussion on the libraries mailing list about the Package Versioning Policy (PVP).

This blog post presents some concrete changes I'd like to see to the PVP to make it better for both general consumers of Hackage, and for library authors as well. I'll start off with a summary of the changes, and then give the explanations:

  1. The goal of the PVP needs to be clarified. Its purpose is not to ensure reproducible builds of non-published software, but rather to provide for more reliable builds of libraries on Hackage. Reproducible builds should be handled exclusively through version freezing, the only known technique to actually give the necessary guarantees.

  2. Upper bounds should not be included on non-upgradeable packages, such as base and template-haskell (are there others?). Alternatively, we should establish some accepted upper bound on these packages, e.g. many people place base < 5 on their code.

  3. We should be distinguishing between mostly-stable packages and unstable packages. For a package like text, if you simply import Data.Text (Text, pack, reverse), or some other sane subset, there's no need for upper bounds.

    Note that this doesn't provide a hard-and-fast rule like the current PVP, but is rather a matter of discretion. Communication between library authors and users (via documentation or other means) would be vital to making this work well.

  4. For a package version A.B.C, a bump in A or B indicates some level of breaking change. As an opt-in approach, package authors are free to associated meaning to A and B beyond what the PVP requires. Libraries which use these packages are free to rely on the guarantees provided by package authors when placing upper bounds.

    Note that this is very related to point (3).

While I (Michael Snoyman) am the author of this proposal, the following people have reviewed the proposal and support it:

  • Bryan O'Sullivan
  • Felipe Lessa
  • Roman Cheplyaka
  • Vincent Hanquez
Reproducible builds

There are a number of simple cases that can result in PVP-compliant code not being buildable. These aren't just hypothetical cases; in my experience as both a package author and Stackage maintainer, I've seen these come up.

  • Package foo version 1.0 provides an instance for MonadFoo for IO and Identity. Version 1.1 removes the IO instance for some reason. Package bar provides a function:

    bar :: MonadFoo m => Int -> m Double

    Package bar compiles with both version 1.0 and 1.1 of foo, and therefore (following the PVP) adds a constraint to its cabal file foo >= 1.0 && < 1.2.

    Now a user decides to use the bar package. The user never imports anything from foo, and therefore has no listing for foo in the cabal file. The user code depends on the IO instance for MonadFoo. When compiled with foo 1.0, everything works fine. However, when compiled with foo 1.1, the code no longer compiles.

  • Similarly, instead of typeclass instances, the same situation can occur with module export lists. Consider version 1.0 of foo which provides:

    module Foo (foo1, foo2) where

    Version 1.1 removes the foo2 export. The bar package reexports the entire Foo module, and then a user package imports the module from bar. If the user package uses the foo2 function, it will compile when foo-1.0 is used, but not when foo-1.1 is used.

In both of these cases, the issue is the same: transitive dependencies are not being clamped down. The PVP makes an assumption that the entire interface for a package can be expressed in its version number, which is not true. I see three possible solutions to this:

  1. Try to push even more of a burden onto package authors, and somehow make them guarantee that their interface is completely immune to changes elsewhere in the stack. This kind of change was proposed on the libraries list. I'm strongly opposed to some kind of change like this: it makes authors' lives harder, and makes it very difficult to provide backwards compatibility in libraries. Imagine if transformers 0.4 adds a new MonadIO instance; the logical extreme of this position would be to disallow a library from working with both transformers 0.3 and 0.4, which will split Hackage in two.

  2. Modify the PVP so that instead of listing just direct dependencies, authors are required to list all transitive dependencies as well. So it would be a violation to depend on bar without explicitly listing foo in the dependency list. This will work, and be incredibly difficult to maintain. It will also greatly increase the time it takes for a new version of a deep dependency to be usable due to the number of authors who will have to bump version bounds.

  3. Transfer responsibility for this to package users: if you first built your code against foo 1.0, you should freeze that information and continue building against foo 1.0, regardless of the presence of new versions of foo. Not only does this increase reproducibility, it's just common sense: it's entirely possible that new versions of a library will introduce a runtime bug, performance regression, or even fix a bug that your code depends on. Why should the reliability of my code base be dependent on the actions of some third party that I have no control over?

Non-upgradeable packages

There are some packages which ship with GHC and cannot be upgraded. I'm aware of at least base and template-haskell, though perhaps there are others (haskell98 and haskell2010?). In the past, there was good reason to place upper bounds on base, specifically with the base 3/4 split. However, we haven't had that experience in a while, and don't seem to be moving towards doing that again. In today's world, we end up with the following options:

  • Place upper bounds on base to indicate "I haven't tested this with newer versions of GHC." This then makes it difficult for users to test out that package with newer versions of GHC.
  • Leave off upper bounds on base. Users may then try to install a package onto a version of GHC on which the package hasn't been tested, which will result in either (1) everything working (definitely the common case based on my Stackage maintenance), or (2) getting a compilation error.

I've heard two arguments to push us in the direction of keeping the upper bounds in this case, so I'd like to address them both:

  • cabal error messages are easier to understand than GHC error messages. I have two problems with that:
    • I disagree: cabal error messages are terrible. (I'm told this will be fixed in the next version of cabal.) Take the following output as a sample:

      cabal: Could not resolve dependencies: trying: 4Blocks-0.2 rejecting: base-4.6.0.1/installed-8aa... (conflict: 4Blocks => base>=2 && <=4) rejecting: base-4.6.0.1, 4.6.0.0, 4.5.1.0, 4.5.0.0, 4.4.1.0, 4.4.0.0, 4.3.1.0, 4.3.0.0, 4.2.0.2, 4.2.0.1, 4.2.0.0, 4.1.0.0, 4.0.0.0, 3.0.3.2, 3.0.3.1 (global constraint requires installed instance)

      I've seen a number of users file bug reports not understanding that this message means "you have the wrong version of GHC."

    • Even if the error messages were more user-friendly, they make it more difficult to fix the actual problem: the code doesn't compile with the new version of GHC. Often times, I've been able to report an error message to a library author and, without necessarily even downloading the new version of GHC, he/she has been able to fix the problem.

  • Using upper bounds in theory means that cabal will be able to revert to an older version of the library that is compatible with the new version of GHC. However, I find it highly unlikely that there's often- if ever- a case where an older version of a library is compatible with a later version of GHC.

Mostly-stable, and finer-grained versioning

I'll combine the discussion of the last two points. I think the heart of the PVP debates really comes from mostly-stable packages. Let's contrast with the extremes. Consider a library which is completely stable, never has a breaking change, and has stated with absolute certainty that it never will again. Does anyone care about upper bounds on this library? They're irrelevant! I'd have no problem with including an upper bound, and I doubt even the staunchest PVP advocates would really claim it's a problem to leave it off.

On the other hand, consider an extremely unstable library, which is releasing massively breaking changes on a weekly basis. I would certainly agree in that case that an upper bound on that library is highly prudent.

The sticking point is the middle ground. Consider the following code snippet:

import Data.Text (Text, pack) foo :: Text foo = pack "foo"

According to the PVP as it stands today, this snippet requires an upper bound of < 1.2 on the text package. But let's just play the odds here: does anyone actually believe there's a real chance that the next iteration of text will break this code snippet? I highly doubt it; this is a stable subset of the text API, and I doubt it will ever be changing. The same can be said of large subsets of many other packages.

By putting in upper bounds in these cases, we run a very real risk of bifurcating Hackage into "those demanding the new text version for some new feature" vs "those who haven't yet updated their upper bounds to allow the new version of text."

The PVP currently takes an extremely conservative viewpoint on this, with the goal of solving just one problem: making sure code that compiles now continues to compile. As I demonstrated above, it doesn't actually solve that problem completely. And in addition, in this process, it has created other problems, such as this bifurcation.

So my proposal is that, instead of creating rigid rules like "always put an upper bound no matter what," we allow some common sense into the process, and also let package authors explicitly say "you can rely on this API not changing."

Categories: Offsite Blogs

Noam Lewis: Generate Javascript classes for your .NET types

Planet Haskell - Tue, 04/08/2014 - 9:36am

We open-sourced another library: ClosureExterns.NET (on github and nuget). It generates Javascript classes from .NET-based backend types, to preserve type “safety” (as safe as Javascript gets) across front- and backend. As a bonus you get Google closure annotations. The type annotations are understood by WebStorm (and other editors) and improve your development experience. Also, if you use Google Closure to compile or verify your code, it will take these types into account. We use it extensively with C#. We haven’t tried it with F#, but it’s supposed to work with any .NET type.

ClosureExterns.NET makes it easier to keep your frontend models in sync with your backend. The output is customizable – you can change several aspects of the generated code. For example you can change the constructor function definition, to support inheritance from some other Javascript function. For more details see ClosureExternOptions.

Getting Started

First, install it. Using nuget, install the package ClosureExterns.

Then, expose a method that generates your externs. For example, a console application:

public static class Program { public static void Main() { var types = ClosureExternsGenerator.GetTypesInNamespace(typeof(MyNamespace.MyType)); var output = ClosureExternsGenerator.Generate(types); Console.Write(output); } }

You can also customize the generation using a ClosureExternsOptions object.

Example input/output Input class B { public int[] IntArray { get; set; } } Output var Types = {}; // ClosureExterns.Tests.ClosureExternsGeneratorTest+B /** @constructor */ Types.B = function() {}; /** @type {Array.<number>} */ Types.B.prototype.intArray = null;

For a full example see the tests.


Tagged: .NET, c#, Javascript
Categories: Offsite Blogs

Why is this algorithm so much slower in Haskell than in C?

Haskell on Reddit - Tue, 04/08/2014 - 7:39am

Code in haskell: http://pastebin.com/29wRDiU5 compiled with ghc -O2 filename Code in C: http://pastebin.com/L33n9S88 compiled with gcc -O3 --std=c99 -lm filename I need to generate fifty 40-bit numbers where each bit is set with probability p (in my examples p is set to 0.001). In haskell code it's represented by fractions of 10000 for speedup, so sorry if it's a bit unclear. It's a part of a bigger algorithm, but according to a profiler the slowest one. My question is why is Haskell version so much slower than C? (~4sec vs ~0.25sec)

submitted by Envielox
[link] [34 comments]
Categories: Incoming News

Syntax highlighted markdown in this subreddit

Haskell on Reddit - Tue, 04/08/2014 - 5:28am

Any chance of getting codeblocks syntax highlighted as Haskell in this subreddit? The standard way these days is you write

``` haskell x = foo `bar` mu where y = 2 ```

E.g. this works on Github and Hakyll. Right now in this subreddit, if you write the above, you get:

haskell x = foo `bar` mu where y = 2

Which generates:

<p><code>haskell x = foo `bar` mu where y = 2 </code></p>

Meaning some nifty JavaScript could find such elements identified by <p><code>…</code></p> and then haskell followed by a newline to be highlighted as Haskell code. If reddit ever explicitly supports code fences, then we can just turn this off.

Question is: can moderators add scripts to a subreddit due to the security issue? Maybe if a reddit admin approved it?

submitted by cookedbacon
[link] [10 comments]
Categories: Incoming News

New gtk2hs 0.12.4 release

gtk2hs - Wed, 11/21/2012 - 12:56pm

Thanks to John Lato and Duncan Coutts for the latest bugfix release! The latest packages should be buildable on GHC 7.6, and the cairo package should behave a bit nicer in ghci on Windows. Thanks to all!

~d

Categories: Incoming News