News aggregator

How to store complex values with Persistent

Haskell on Reddit - Wed, 07/02/2014 - 5:50pm

I'm trying to store a record with a field of type [Text] using Yesod/Persistent. This (fairly old) blog post mentions in the section labelled "Complex Data Structures Support!" that Persistent now supports entities with such fields, but I can't figure out how to actually do that. I get errors when I try the obvious thing (i.e., just declaring an entity with a field of type [Text]). Does anyone know how to do this?

Edit: The plot thickens:

When I give my entity a field questions :: [Text] and start the development server, I get an SQL error:

Migrating: CREATE TEMP TABLE "user_backup"("id" INTEGER PRIMARY KEY,"ident" VARCHAR NOT NULL,"password" VARCHAR NULL,"questions" VARCHAR NOT NULL,CONSTRAINT "unique_user" UNIQUE ("ident")) Migrating: INSERT INTO "user_backup"("id","ident","password") SELECT "id","ident","password" FROM "user" devel.hs: user error (SQLite3 returned ErrorConstraint while attempting to perform step.)

Second edit:

I deleted my database and Yesod created a new one which works, but is this avoidable?

submitted by seriousreddit
[link] [3 comments]
Categories: Incoming News

Why does GHC always box polymorphic fields in datastructures?

Haskell on Reddit - Wed, 07/02/2014 - 3:20pm

Hello /haskell/, I was reading Johan Tibbe's talk on GHC performance and one thing baffled me, and I quote:

Polymorphic fields are always stored as pointer-to-thing, which increases memory usage and decreases cache locality. Compare:

data Tree a = Leaf | Bin a !(Tree a) !(Tree a) data IntTree = IntLeaf | IntBin {-# UNPACK #-} !Int !IntTree !IntTree

Specialized data types can be faster, but at the cost of code duplication. Benchmark your code and only use them if really needed.

It turns out that Haskell's polymorphic types are not a zero-cost abstraction, and it would be perfectly normal in a unityped language where generics can only be implemented through runtime checks, but can't Haskell do better than that? After all, GHC has type inference, it knows the types of everything at compile time, it can infer typeclass instances at compile time, making typeclass methods zero-overhead, etc. So seeing that a Tree Int for some reason cannot be represented the same as an IntTree is... baffling. Is there a fundamental reason for that, or is it just a wart peculiar to GHC?

P.S. I'm in no way saying that the performance cost is terrible or that we should all stop using parametric polymorphism, I'm just baffled as to possible reasons for this strange decision.

Thank you.

submitted by aicubierre
[link] [29 comments]
Categories: Incoming News

PhD studentship on interval computation in Haskell

General haskell list - Wed, 07/02/2014 - 11:50am
------------------------------------------------- PhD studentship - Interval Computation in Haskell ------------------------------------------------- * Applicants should have a strong background in real analysis and functional programming. * The closing date for applications is 18th July 2014. * The project will be supervised by Michal Konečný, Aston University. * The student will receive a 3-year studentship of £15,500/year. * UK/EU student's fee is covered, non-EU student's fee is £10,914 in 2014/2015. * The student will act as a teaching assistant for a distance learning course approx. 7h/week. * For more information, use the following links: o Description of the potential project topics[1] o AERN - a Haskell interval computation library[2] o Supervisor's home page[3] o Department research home page[4] o Details of the studentship and links to application forms[5] (The above advert is also available at http:/
Categories: Incoming News

How to get good performance when processing binary data with conduit + vector/bytestring?

Haskell on Reddit - Wed, 07/02/2014 - 10:53am

I've been playing around with some data compression algorithms. Basically, code that loads some data from disk and then transforms, analyzes, shuffles around, compresses etc. I thought conduits and vector / bytestring would be a natural representation for those algorithms in the Haskell world, but I'm having a very hard time producing code that is elegant and fast.

Conduit seems to have a very large overhead that makes yielding individual words between stages unacceptably slow. Here:

runResourceT $ CB.sourceFile "test.dat" =$ CC.concat $$ CL.fold (+) 0

Just adding up all the bytes in a file, basically. That already takes like .5s/MB, which is pretty horrendous. It seems clear that the way to get decent performance is to always await / yield some chunk of data.

The first issue I have with this is that it renders many useful combinators unusable. For instance, I was hoping to use the conduitVector combinator to turn the input file into 1MB chunks of vector data for a blocksorting transform, but that seems impractical knowing how conduit performs with streams of singular Word8s.

Further, I struggle with actually outputting those bytestring chunks. Imagine a conduit performing some transformation, like run length coding or huffman compression. You read data from the input, and sometimes write a byte or more to the output. Just a buffer that fills till we can yield a sizeable chunk.

Haskell, to my knowledge, lacks a structure like an std::vector with amortized O(1) append and the compact storage of an array. We could either use a mutable vector for the chunk and progressively fill it, but then we're faced with the problem of efficiently converting a vector back into a bytestring. While possible, it's a bit of a low-level mess and there is no direct support for it.

There are no mutable / growing bytestrings (sure, I know why), and the best construct we have seems to be a bytestring builder. It seems fairly wasteful to build up mconcat'enated chunks of builders, but I gave that a shot. Neither brief, nor simple, but here is what I came up with:

type OutputBS = (Int, Int, BB.Builder) emptyOutputBS :: Int -> OutputBS emptyOutputBS chunkSize = (chunkSize, 0, mempty) outputByte :: Monad m => Word8 -> OutputBS -> forall i. ConduitM i B.ByteString m OutputBS outputByte w8 obs@(chunkSize, numBytes, _) | numBytes >= chunkSize = flushOutputBS obs >> addW8 (emptyOutputBS chunkSize) | otherwise = addW8 obs where addW8 (chunkSize', numBytes', builder') = return (chunkSize', numBytes' + 1, builder' <> BB.word8 w8) flushOutputBS :: Monad m => OutputBS -> forall i. ConduitM i B.ByteString m () flushOutputBS (_, numBytes, builder) | numBytes > 0 = yield (BL.toStrict $ BB.toLazyByteString builder) | otherwise = return () processAndChunkOutput :: Monad m => Conduit B.ByteString m B.ByteString processAndChunkOutput = flip evalStateT (emptyOutputBS 65536) loop where loop = (lift await) >>= \case Nothing -> get >>= lift . flushOutputBS Just bs -> do forM_ [0..B.length bs - 1] $ \i -> outputByteState (bs `B.index` i) loop outputByteState w8 = get >>= (\obs -> lift $ outputByte w8 obs) >>= put

This works as expected, but the performance is also at .5s/MB. Replacing the builder with a simple list that's reversed before being packed into a bytestring in the end is ~40% or so faster, but still too slow.

Looking further down the road for my compression code, I see more potential issues with this approach. Like if I want to use repa for a wavelet transform or FFT or so at a stage, again having to convert between vector and bytestring.

Can anybody recommend a way to speed up this conduit pipeline? Do you think this is in general a sound way of structuring the input, output, analysis and transformation stages of a compression program?

Thanks!

submitted by SirRockALot1
[link] [30 comments]
Categories: Incoming News

Closed type families, apartness, and occurs check

glasgow-user - Wed, 07/02/2014 - 9:19am
From the user manual, it sounds like a clause of a closed type family should be rejected once no subsitution of the type could make it unify with the clause. If so, it doesn't seem to do an occurs check: type family IsEq a b :: Bool where   IsEq a a = True   IsEq a b = False forall a . IsEq a a :: Bool = forall (a :: k). 'True forall a . IsEq a [a] :: Bool = forall a. IsEq a [a] I came across this while trying to using Generics to find the immediate children of a term - this sort of non-reduction happens while comparing a type like (Term var) with a constructor argument of type var. Brandon _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users< at >haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Categories: Offsite Discussion

Help installing GHC

Haskell on Reddit - Wed, 07/02/2014 - 8:29am

I'm having a problem installing GHC on my device running Apple ARM Darwin (it's a jailbroken device). I already have gcc (compiler) so I was hoping to build a version of ghc for device.

When I do 'distrib/hc-build' I get an error saying can't workout build platform.

Does anybody know how to fix this or install ghc on ARM Darwin?

http://i.imgur.com/obl031r.jpg

submitted by H1DD3NT3CH
[link] [6 comments]
Categories: Incoming News

Request for code review: Anagrammer (first Haskell program)

Haskell on Reddit - Wed, 07/02/2014 - 7:42am

I'm working on a Scrabble AI, and this is my implementation of a Trie: https://github.com/int3/scrabble-solver/blob/53f5774453cc4f218117051f7075923455370754/Trie.hs

I spent quite a while trying to see if it was possible to rewrite findConstraints and findAnagrams as folds on the trie, but I couldn't wrap my head around it. Do you think it is possible / reasonable to attempt this, or does the algorithm not fit the mold of a fold?

I'm also interested in comments on general code quality!

submitted by int3_
[link] [2 comments]
Categories: Incoming News

Dead-simple graphing of custom tree-like data types

Haskell on Reddit - Wed, 07/02/2014 - 3:04am

Hi, there! I just wrote a package (link) that's a wrapper around the boilerplate necessary to graph data structures; if you've written up any data structures that are similar to trees, this should make it extremely straightforward to generate .png files for the current state of your ADT.

Though the purpose of this post is to make it easier for everybody to graph their stuff, unsolicited feedback is more than welcome, should you be so inclined ;)

submitted by gravityman-142857
[link] [23 comments]
Categories: Incoming News

Ketil Malde: Expected site information from SNPs

Planet Haskell - Wed, 07/02/2014 - 2:00am

Lately, I’ve been working on selecting SNPs, the main goal is often to classify individuals as belonging to some specific population. For instance, we might like to genotype a salmon to see if it is from the local population or an escapee from a sea farm, or perhaps a migrant from a neighboring river? And if it’s an escapee, we might want to know which farm it escaped from. In short, we want to find SNPs that are diagnostic.

Typically, this is done by sequening pools of individuals, mapping the reads to the reference genome, identifying variant positions, and ranking them - typically using FST, sometimes also using p-values for the confidence in an actual allele difference, and maybe filtering on sequencing coverage and base- or mapping quality. However, FST really isn’t a suitable tool for this purpose. I’m therefore proposing the following. Let me know if it makes sense or not.

Expected Site Information

For diagnostic SNP, what we really would like to know is the amount of information observing each site contributes. Using Bayes theorem, observing an allele a in some individual N, gives us the following posterior probability for N belonging to some population A, where the allele frequency, P(aA), is known:

P(A|a) = P(a|A)P(A)/P(a)

Here, P(A) is our prior probability of N belonging to A, which after observing a is modified by a factor of

P(a|A)/P(a)

In order to assign N to one of several populations (either (A) or B, say), we are interested in the relative probabilities for the two hypotheses. In other words, we would like the odds for N belonging to one population or the other. Given the probabilities of P(aA) and (P(a|B)), and initial odds (P(A)/P(B)), we get

P(A|a)/P(B|a) = [P(a|A)P(A)/P(a)]/[P(a|B)P(B)/P(a)]

Canceling out P(a), we find that the prior odds are modified by:

P(a|A)/P(a|B)

That is, the ratio of this allele’s frequencies in each of the populations. For practical reasons, it is common to take the logarithm of the odds. This gives us scores that are additive and symmetric (so that switching the two populations gives us the same score with the opposite sign). Specifically, base two logarithms will give us the score in bits.

When observing a site, we may of course also encounter the alternative allele. By the same reasoning as above, we find that this allele modifies the odds by

[1-P(a|A)]/[1-P(a|B)]

Lacking any prior information, we can consider each population equally likely, and the likelihood of observing a particular allele is the average of the likelihood in each population. The information gain from each possible allele is then averaged, weighted by this average likelihood. For a biallelic site with major allele frequencies p and (q) (and consequentially, minor allele frequencies of 1 − p and (1-q)) in the two populations, the expected added information from the site then becomes:

I(p,q) = |(p+q)/2 log_2(p/q)| + |(1-(p+q)/2)log_2((1-p)/(1-q)) |

Note that we are here only interested in the amount of information gained, regardless of which hypothesis it favors, and thus we take the absolute values. For a site with multiple alleles enumerated by i and with frequency vectors p and q in the two populations, this generalizes to the weighted sum of log2(pi / qi).

Unlike measures like FST, measures of I is additive (assuming independence between sites), so the information gained from observing mulitple sites is readily calculated. From observing the information gained from observing each site, we will also be able to compare different sets of sites, and e.g., compare the value of a single site with minor allele frequencies (MAF) of, say, 0.1 and 0.3 to two sites with MAF of 0.2 and 0.3.

It may also be instructive to compare this procedure to sequence alignment and position specific score matrices (PSSMs). In sequence alignment, a sequence of nucleotides or amino acids are scored by comparing its match to a target sequence to its match to some base model using log odds scores. The base model to compare against is often implicit (typically using sequences of random composition), but more elaborate models are also possible Similarly, position specific frequency matrices are often converted to position specific score matrices using log odds. Calculating the information value from a set of observed alleles is then analogous to scoring an “alignment” of the set of observed alleles to two different sets of allele frequencies.

Allele frequency confidence intervals

In order to apply the above method in practice, we need to measure the allele frequencies in the population. This is problematic for two reasons. First, we do not have precise knowledge of the allele frequencies, we can only estimate them from our sequenced sample. This introduces sampling bias. Second, the sequencing process introduces additional artifacts. For instance, sequencing errors often result in substitutions, which are observed as apparent alleles. In addition, sequences can be incorrectly mapped, contain contamination, the reference genome can contain collapsed repeats, and the chemistry of the sequencing process is usually also biased – for instance, coverage is often biased by GC content. These artifacts often give the false appearance of variant positions.

One challenge with calculating site information from sequencing data (as opposed to using allele frequencies directly), is that such errors in the data can vastly overestimate the information content. For instance, an allele that appears to be fixed in one population means that any other observed allele will assign the individual to the alternative population - regardless of any other alleles. It is easy to see that an allele frequency of zero results in the odds going either to zero or infinity, and thus the log odds will go to either positive or negative infinity.

For diagnostic SNP discovery, it is more important to ensure that identified SNPs are informative, than to precisely estimate the information content. Thus, we take a conservative approach and use upper and lower limits for the allele frequencies by calculating confidence intervals using the method by Agresti-Coull. In addition, the limits are also adjusted by a factor ε, corresponding to sequencing error rate.

Software implementation

I’ve implemented this (that is, the conservative measure) as described above in a software tool called varan. It parses sequence alignments in the standard “mpileup” format as output by the samtools mpileup command. It can currently output several different statistics and estimators, including expected site information. This is a work in progress, so please get in touch if you wish to try it out.

Categories: Offsite Blogs

Evaluate and compare two expressions

Haskell on Reddit - Wed, 07/02/2014 - 1:47am

import Test.HUnit (runTestTT,Test(TestLabel,TestList),(~?)) import Data.Function (on)

-- | A very simple data type for expressions.

data Expr = Const Int | Add Expr Expr deriving Show

-- | 'Expression' is an instance of 'Num'. You will get warnings because -- many required methods are not implemented.

instance Num Expr where fromInteger = Const . fromInteger (+) = Add

-- | Equality of 'Expr's modulo associativity.

instance Eq Expr where (==) = error "Not yet implementd: (==)"

-- | A test expression.

testexpression1 :: Expr testexpression1 = 3 + (4 + 5)

-- | A test expression.

testexpression2 :: Expr testexpression2 = (3 + 4) + 5

-- | A test expression.

testexpression3 :: Expr testexpression3 = 2 + 5 + 5

_________________________-

testexpression1 should be equal to testexpression2. And they should ne NOT equal to testexpression3.

Yes, it's homework, so I'm searching for hints, no solutions! :)

Thanks in advance.

submitted by fuuman1
[link] [7 comments]
Categories: Incoming News

Expressing "complexity" of higher kinds/type classes?

Haskell on Reddit - Tue, 07/01/2014 - 9:11pm

Hi there! The StackOverflow answer here talks about monads in terms of "Turing-complete compositions", with a good bit of debate in its comments about whether that's a meaningful statement.

Since it seems so natural to talk about, e.g., Functor-Applicative-Monad in terms of relative "power", I wonder if a formalization of that "power" in terms of computational complexity exists (possibly, e.g., a way towards showing whether a given kind allows for "Turing-completeness" in some way). Are there any references around for this kind of work?

Apologies if this isn't clear; because the information surrounding it is so scarce, I'm having a little trouble trying to pose the question clearly.

submitted by babblingbree
[link] [14 comments]
Categories: Incoming News

wren gayle romano: Haskell Planet

Planet Haskell - Tue, 07/01/2014 - 6:44pm

Hello all,

This is just a short blurb to announce that I'm changing it so that Haskell Planet only picks up on posts tagged with haskell planet. For the next while here, I'll probably be writing about social justice, gender theory, and personal topics more often than Haskell and theoretical mathematics; so I don't want to flood the Planet with too much off-topic material.

Worry not, the pendulum will swing back again— it always does. But lately, this is where my thoughts have been focused.



comments
Categories: Offsite Blogs

use of -Werror

haskell-cafe - Tue, 07/01/2014 - 6:25pm
Is there a consensus within the Haskell community on the use of -Werror in packages uploaded to Cabal? Most places where I've worked, building with -Werror has been considered a laudable ideal, but fixing all the old warnings in the codebase was too much work, so it wasn't done. I suppose -Werror could cause problems in heterogeneous build environments.
Categories: Offsite Discussion

Twan van Laarhoven: Dependent equality with the interval

Planet Haskell - Tue, 07/01/2014 - 4:56pm

Here is a way to represent heterogeneous or dependent equalities, based on an interval type. In Homotopy Type Theory the interval is usually presented as a Higher Inductive Type with two constructors and a path between them. Here I will just give the two constructors, the path is implicit

data I : Set where i₁ : I i₂ : I -- there is usually a path, i-edge : i₁ ≡ i₂

The eliminator is

i-elim : ∀ {a} {A : I → Set a} → (x₁ : A i₁) → (x₂ : A i₂) → (Eq A x₁ x₂) → (i : I) → A i i-elim x₁ x₂ eq i₁ = x₁ i-elim x₁ x₂ eq i₂ = x₂

Here the type Eq is the dependent equality, which has type

Eq : ∀ {a} (A : I → Set a) → (x₁ : A i₁) → (x₂ : A i₂) → Set a

so we take a type parametrized by an interval, and two values of that type at the two endpoints of this interval. We can also define "heterogeneous reflexivity", a generalization of the usual refl function:

refl : ∀ {a} {A : I → Set a} → (x : (i : I) → A i) → Eq A (x i₁) (x i₂)

This function can be used to extract the third part of i-elim, with the reduction

refl (i-elim x₁ x₂ eq) = eq

I believe this can be used as the basis for an observational type theory, where Eq A and refl x reduce. The above is the first case for refl, the rest is "just" tedious structural recursion such as

Eq (\i → A i × B i) x y = Eq A (proj₁ x) (proj₁ y) × Eq B (proj₂ x) (proj₂ y) refl (\i → x i , y i) = refl x , refl y

and

Eq (\i → A i → B i) f g = {x : A i₁} → {y : A i₂} → Eq A x y → Eq B (f x) (g y) refl (\i → \(x : A i) → f i x) = \{x} {y} xy → refl (\i → f i (i-elim x y xy i))

or we can actually use the dependent equality and be more general

Eq (\i → Σ (x₁ : A i) (B i x₁)) x y = Σ (x₁y₁ : Eq A (proj₁ x) (proj₁ y)) (Eq (\i → B i (i-elim (proj₁ x) (proj₁ y) x₁y₁ i)) (proj₂ x) (proj₂ y)) Eq (\i → (x : A i) → B i) f g = {x : A i₁} → {y : A i₂} → (xy : Eq A x y) → Eq (\i → B i (i-elim x y xy i)) (f x) (g y)

Of course there is a lot more to it, but that is not the subject of this post.

As a final remark: if you are not too touchy about typing, then refl could even be implemented with the path i-edge between i₁ and i₂

i-edge : Eq (\_ → I) i₁ i₂ i-elim x₁ x₂ eq i-edge = eq refl foo = foo i-edge

But I'd rather not do that.

Categories: Offsite Blogs

New gtk2hs 0.12.4 release

gtk2hs - Wed, 11/21/2012 - 12:56pm

Thanks to John Lato and Duncan Coutts for the latest bugfix release! The latest packages should be buildable on GHC 7.6, and the cairo package should behave a bit nicer in ghci on Windows. Thanks to all!

~d

Categories: Incoming News