# News aggregator

### Problem with type in a function

### How to store complex values with Persistent

I'm trying to store a record with a field of type [Text] using Yesod/Persistent. This (fairly old) blog post mentions in the section labelled "Complex Data Structures Support!" that Persistent now supports entities with such fields, but I can't figure out how to actually do that. I get errors when I try the obvious thing (i.e., just declaring an entity with a field of type [Text]). Does anyone know how to do this?

Edit: The plot thickens:

When I give my entity a field questions :: [Text] and start the development server, I get an SQL error:

Migrating: CREATE TEMP TABLE "user_backup"("id" INTEGER PRIMARY KEY,"ident" VARCHAR NOT NULL,"password" VARCHAR NULL,"questions" VARCHAR NOT NULL,CONSTRAINT "unique_user" UNIQUE ("ident")) Migrating: INSERT INTO "user_backup"("id","ident","password") SELECT "id","ident","password" FROM "user" devel.hs: user error (SQLite3 returned ErrorConstraint while attempting to perform step.)Second edit:

I deleted my database and Yesod created a new one which works, but is this avoidable?

submitted by seriousreddit[link] [3 comments]

### Why does GHC always box polymorphic fields in datastructures?

Hello /haskell/, I was reading Johan Tibbe's talk on GHC performance and one thing baffled me, and I quote:

Polymorphic fields are always stored as pointer-to-thing, which increases memory usage and decreases cache locality. Compare:

data Tree a = Leaf | Bin a !(Tree a) !(Tree a) data IntTree = IntLeaf | IntBin {-# UNPACK #-} !Int !IntTree !IntTreeSpecialized data types can be faster, but at the cost of code duplication. Benchmark your code and only use them if really needed.

It turns out that Haskell's polymorphic types are not a zero-cost abstraction, and it would be perfectly normal in a unityped language where generics can only be implemented through runtime checks, but can't Haskell do better than that? After all, GHC has type inference, it knows the types of everything at compile time, it can infer typeclass instances at compile time, making typeclass methods zero-overhead, etc. So seeing that a Tree Int for some reason cannot be represented the same as an IntTree is... baffling. Is there a fundamental reason for that, or is it just a wart peculiar to GHC?

P.S. I'm in no way saying that the performance cost is terrible or that we should all stop using parametric polymorphism, I'm just baffled as to possible reasons for this strange decision.

Thank you.

submitted by aicubierre[link] [29 comments]

### PhD studentship on interval computation in Haskell

### How to get good performance when processing binary data with conduit + vector/bytestring?

I've been playing around with some data compression algorithms. Basically, code that loads some data from disk and then transforms, analyzes, shuffles around, compresses etc. I thought conduits and vector / bytestring would be a natural representation for those algorithms in the Haskell world, but I'm having a very hard time producing code that is elegant and fast.

Conduit seems to have a very large overhead that makes yielding individual words between stages unacceptably slow. Here:

runResourceT $ CB.sourceFile "test.dat" =$ CC.concat $$ CL.fold (+) 0Just adding up all the bytes in a file, basically. That already takes like .5s/MB, which is pretty horrendous. It seems clear that the way to get decent performance is to always await / yield some chunk of data.

The first issue I have with this is that it renders many useful combinators unusable. For instance, I was hoping to use the conduitVector combinator to turn the input file into 1MB chunks of vector data for a blocksorting transform, but that seems impractical knowing how conduit performs with streams of singular Word8s.

Further, I struggle with actually outputting those bytestring chunks. Imagine a conduit performing some transformation, like run length coding or huffman compression. You read data from the input, and sometimes write a byte or more to the output. Just a buffer that fills till we can yield a sizeable chunk.

Haskell, to my knowledge, lacks a structure like an std::vector with amortized O(1) append and the compact storage of an array. We could either use a mutable vector for the chunk and progressively fill it, but then we're faced with the problem of efficiently converting a vector back into a bytestring. While possible, it's a bit of a low-level mess and there is no direct support for it.

There are no mutable / growing bytestrings (sure, I know why), and the best construct we have seems to be a bytestring builder. It seems fairly wasteful to build up mconcat'enated chunks of builders, but I gave that a shot. Neither brief, nor simple, but here is what I came up with:

type OutputBS = (Int, Int, BB.Builder) emptyOutputBS :: Int -> OutputBS emptyOutputBS chunkSize = (chunkSize, 0, mempty) outputByte :: Monad m => Word8 -> OutputBS -> forall i. ConduitM i B.ByteString m OutputBS outputByte w8 obs@(chunkSize, numBytes, _) | numBytes >= chunkSize = flushOutputBS obs >> addW8 (emptyOutputBS chunkSize) | otherwise = addW8 obs where addW8 (chunkSize', numBytes', builder') = return (chunkSize', numBytes' + 1, builder' <> BB.word8 w8) flushOutputBS :: Monad m => OutputBS -> forall i. ConduitM i B.ByteString m () flushOutputBS (_, numBytes, builder) | numBytes > 0 = yield (BL.toStrict $ BB.toLazyByteString builder) | otherwise = return () processAndChunkOutput :: Monad m => Conduit B.ByteString m B.ByteString processAndChunkOutput = flip evalStateT (emptyOutputBS 65536) loop where loop = (lift await) >>= \case Nothing -> get >>= lift . flushOutputBS Just bs -> do forM_ [0..B.length bs - 1] $ \i -> outputByteState (bs `B.index` i) loop outputByteState w8 = get >>= (\obs -> lift $ outputByte w8 obs) >>= putThis works as expected, but the performance is also at .5s/MB. Replacing the builder with a simple list that's reversed before being packed into a bytestring in the end is ~40% or so faster, but still too slow.

Looking further down the road for my compression code, I see more potential issues with this approach. Like if I want to use repa for a wavelet transform or FFT or so at a stage, again having to convert between vector and bytestring.

Can anybody recommend a way to speed up this conduit pipeline? Do you think this is in general a sound way of structuring the input, output, analysis and transformation stages of a compression program?

Thanks!

submitted by SirRockALot1[link] [30 comments]

### Closed type families, apartness, and occurs check

### Help installing GHC

I'm having a problem installing GHC on my device running Apple ARM Darwin (it's a jailbroken device). I already have gcc (compiler) so I was hoping to build a version of ghc for device.

When I do 'distrib/hc-build' I get an error saying can't workout build platform.

Does anybody know how to fix this or install ghc on ARM Darwin?

http://i.imgur.com/obl031r.jpg

submitted by H1DD3NT3CH[link] [6 comments]

### Request for code review: Anagrammer (first Haskell program)

I'm working on a Scrabble AI, and this is my implementation of a Trie: https://github.com/int3/scrabble-solver/blob/53f5774453cc4f218117051f7075923455370754/Trie.hs

I spent quite a while trying to see if it was possible to rewrite findConstraints and findAnagrams as folds on the trie, but I couldn't wrap my head around it. Do you think it is possible / reasonable to attempt this, or does the algorithm not fit the mold of a fold?

I'm also interested in comments on general code quality!

submitted by int3_[link] [2 comments]

### Dead-simple graphing of custom tree-like data types

Hi, there! I just wrote a package (link) that's a wrapper around the boilerplate necessary to graph data structures; if you've written up any data structures that are similar to trees, this should make it extremely straightforward to generate .png files for the current state of your ADT.

Though the purpose of this post is to make it easier for everybody to graph their stuff, unsolicited feedback is more than welcome, should you be so inclined ;)

submitted by gravityman-142857[link] [23 comments]

### Ketil Malde: Expected site information from SNPs

Lately, I’ve been working on selecting SNPs, the main goal is often to classify individuals as belonging to some specific population. For instance, we might like to genotype a salmon to see if it is from the local population or an escapee from a sea farm, or perhaps a migrant from a neighboring river? And if it’s an escapee, we might want to know which farm it escaped from. In short, we want to find SNPs that are *diagnostic*.

Typically, this is done by sequening pools of individuals, mapping the reads to the reference genome, identifying variant positions, and ranking them - typically using *F**S**T*, sometimes also using p-values for the confidence in an actual allele difference, and maybe filtering on sequencing coverage and base- or mapping quality. However, *F**S**T* really isn’t a suitable tool for this purpose. I’m therefore proposing the following. Let me know if it makes sense or not.

For diagnostic SNP, what we really would like to know is the amount of *information* observing each site contributes. Using Bayes theorem, observing an allele *a* in some individual *N*, gives us the following posterior probability for *N* belonging to some population *A*, where the allele frequency, *P*(*a*∣*A*), is known:

Here, P(A) is our prior probability of *N* belonging to *A*, which after observing a is modified by a factor of

In order to assign *N* to one of several populations (either (A) or *B*, say), we are interested in the *relative* probabilities for the two hypotheses. In other words, we would like the *odds* for *N* belonging to one population or the other. Given the probabilities of *P*(*a*∣*A*) and (P(a|B)), and initial odds (P(A)/P(B)), we get

Canceling out *P*(*a*), we find that the prior odds are modified by:

That is, the ratio of this allele’s frequencies in each of the populations. For practical reasons, it is common to take the logarithm of the odds. This gives us scores that are additive and symmetric (so that switching the two populations gives us the same score with the opposite sign). Specifically, base two logarithms will give us the score in bits.

When observing a site, we may of course also encounter the alternative allele. By the same reasoning as above, we find that this allele modifies the odds by

[1-P(a|A)]/[1-P(a|B)]Lacking any prior information, we can consider each population equally likely, and the likelihood of observing a particular allele is the average of the likelihood in each population. The information gain from each possible allele is then averaged, weighted by this average likelihood. For a biallelic site with major allele frequencies *p* and (q) (and consequentially, minor allele frequencies of 1 − *p* and (1-q)) in the two populations, the expected added information from the site then becomes:

Note that we are here only interested in the amount of information gained, regardless of which hypothesis it favors, and thus we take the absolute values. For a site with multiple alleles enumerated by *i* and with frequency vectors **p** and **q** in the two populations, this generalizes to the weighted sum of *l**o**g*2(*p**i* / *q**i*).

Unlike measures like *F**S**T*, measures of *I* is *additive* (assuming independence between sites), so the information gained from observing mulitple sites is readily calculated. From observing the information gained from observing each site, we will also be able to compare different sets of sites, and e.g., compare the value of a single site with minor allele frequencies (MAF) of, say, 0.1 and 0.3 to two sites with MAF of 0.2 and 0.3.

It may also be instructive to compare this procedure to sequence alignment and position specific score matrices (PSSMs). In sequence alignment, a sequence of nucleotides or amino acids are scored by comparing its match to a target sequence to its match to some base model using log odds scores. The base model to compare against is often implicit (typically using sequences of random composition), but more elaborate models are also possible Similarly, position specific frequency matrices are often converted to position specific score matrices using log odds. Calculating the information value from a set of observed alleles is then analogous to scoring an “alignment” of the set of observed alleles to two different sets of allele frequencies.

Allele frequency confidence intervalsIn order to apply the above method in practice, we need to measure the allele frequencies in the population. This is problematic for two reasons. First, we do not have precise knowledge of the allele frequencies, we can only estimate them from our sequenced sample. This introduces sampling bias. Second, the sequencing process introduces additional artifacts. For instance, sequencing errors often result in substitutions, which are observed as apparent alleles. In addition, sequences can be incorrectly mapped, contain contamination, the reference genome can contain collapsed repeats, and the chemistry of the sequencing process is usually also biased – for instance, coverage is often biased by GC content. These artifacts often give the false appearance of variant positions.

One challenge with calculating site information from sequencing data (as opposed to using allele frequencies directly), is that such errors in the data can vastly overestimate the information content. For instance, an allele that appears to be fixed in one population means that any other observed allele will assign the individual to the alternative population - regardless of any other alleles. It is easy to see that an allele frequency of zero results in the odds going either to zero or infinity, and thus the log odds will go to either positive or negative infinity.

For diagnostic SNP discovery, it is more important to ensure that identified SNPs are informative, than to precisely estimate the information content. Thus, we take a conservative approach and use upper and lower limits for the allele frequencies by calculating confidence intervals using the method by Agresti-Coull. In addition, the limits are also adjusted by a factor *ε*, corresponding to sequencing error rate.

I’ve implemented this (that is, the conservative measure) as described above in a software tool called varan. It parses sequence alignments in the standard “mpileup” format as output by the **samtools mpileup** command. It can currently output several different statistics and estimators, including expected site information. This is a work in progress, so please get in touch if you wish to try it out.

### Evaluate and compare two expressions

import Test.HUnit (runTestTT,Test(TestLabel,TestList),(~?)) import Data.Function (on)

-- | A very simple data type for expressions.

data Expr = Const Int | Add Expr Expr deriving Show

-- | 'Expression' is an instance of 'Num'. You will get warnings because -- many required methods are not implemented.

instance Num Expr where fromInteger = Const . fromInteger (+) = Add

-- | Equality of 'Expr's modulo associativity.

instance Eq Expr where (==) = error "Not yet implementd: (==)"

-- | A test expression.

testexpression1 :: Expr testexpression1 = 3 + (4 + 5)

-- | A test expression.

testexpression2 :: Expr testexpression2 = (3 + 4) + 5

-- | A test expression.

testexpression3 :: Expr testexpression3 = 2 + 5 + 5

_________________________-

testexpression1 should be equal to testexpression2. And they should ne NOT equal to testexpression3.

Yes, it's homework, so I'm searching for hints, no solutions! :)

Thanks in advance.

submitted by fuuman1[link] [7 comments]

### Expressing "complexity" of higher kinds/type classes?

Hi there! The StackOverflow answer here talks about monads in terms of "Turing-complete compositions", with a good bit of debate in its comments about whether that's a meaningful statement.

Since it seems so natural to talk about, e.g., Functor-Applicative-Monad in terms of relative "power", I wonder if a formalization of that "power" in terms of computational complexity exists (possibly, e.g., a way towards showing whether a given kind allows for "Turing-completeness" in some way). Are there any references around for this kind of work?

Apologies if this isn't clear; because the information surrounding it is so scarce, I'm having a little trouble trying to pose the question clearly.

submitted by babblingbree[link] [14 comments]

### use of -Werror

### New gtk2hs 0.12.4 release

Thanks to John Lato and Duncan Coutts for the latest bugfix release! The latest packages should be buildable on GHC 7.6, and the cairo package should behave a bit nicer in ghci on Windows. Thanks to all!

~d