News aggregator

alex + happy parser problem

haskell-cafe - Sat, 05/17/2008 - 2:22am
Hi, To learn alex and happy, I'm trying to write a parser for a simple expression language. When I wrote my own lexer and just used happy, it was fine. When I used the basic wrapper of alex it was also fine. However, when I use the posn wrapper to get position information, I get a strange exception when the parse error occurs at the end of the input. For example, parsing "1 + " yields "Internal Happy error" rather than something like "Parse error at line 1, column 5" The lexer and parser are attached. Can anyone see what I'm doing wrong? calling parse "1+" yields a "Internal Happy error" instead of a parse error as I would expect. Thanks, Sean ------------------
Categories: Offsite Discussion

ezekiel smithburg: car modding

Planet Haskell - 2 hours 5 min ago

i’ve had a strong urge over the past while to combine my interest in all things mechanical, electronic/electrical, and programming by getting an oldish (late eighties/early nineties) car and modding it to have all sorts of goodies and gadgets like newer cars do, as well as ones that newer cars in general don’t have yet.  unfortunately, i have neither the time or money to get into such involved and generally pointless endeavors at the moment — but this post is my commitment to myself that i WILL do it if i can manage it in the future.

that is all.

Categories: Offsite Blogs

Building GHC

glasgow-user - 3 hours 8 min ago
When I look at the instructions http://hackage.haskell.org/trac/ghc/wiki/Building/GettingTheSources they say the distribution is split in two but when I look at http://darcs.haskell.org/ it seems to be split in three ghc-HEAD-2007-12-16-..> 16-Dec-2007 14:25 102M ghc-HEAD-2007-12-16-..> 16-Dec-2007 14:20 92M ghc-HEAD-2007-12-16-..> 16-Dec-2007 14:12 57M I'm assuming I unpack these on top of each other but I'd be grateful if someone could confirm this. Thanks, Dominic.
Categories: Offsite Discussion

Proving my point

haskell-cafe - 3 hours 35 min ago
"test.l" (line 7, column 1): unexpected end of input expecting "(", Lambda abstraction, Let binding, Atom, end of input or Function application I obviously don't know anything about Parsec's inner workings. I'm going to investigate as soon as I stopped despairing.
Categories: Offsite Discussion

IntMap intersectionWith type is too restrictive

libraries list - 4 hours 59 min ago
Hi, This is a simple bug report, more or less. The type of Data.IntMap.intersectionWith is too restrictive. Presently (GHC 8.2, http://darcs.haskell.org/packages/containers/Data/) it's declared as (a->b->a) -> IntMap a -> IntMap b -> IntMap a, but if you just remove the type declarations it's inferred as (a->b->c) -> IntMap a -> IntMap b -> IntMap c. Similarly for intersectionWithKey. The more general a->b->c type is also the type of Data.Map.intersectionWith, and it's quite useful, such as (intersectionWith (,)) Scott
Categories: Offsite Discussion

ChessLibrary project

haskell-cafe - 5 hours 48 min ago
All, I've made my first code drop for a new project I'm working on (though it's really an extension of work I've been doing for about 5 years). Information and code can be found at http://code.haskell.org/ChessLibrary/ . Comments and suggestions warmly welcomed. Also, I'm looking for an experienced haskeller to serve as a mentor for this project - just to provide regular code reviews and architectural advice, mainly. Thanks for looking!
Categories: Offsite Discussion

Edward Kmett: Forgetful Laziness

Planet Haskell - 5 hours 52 min ago

Does anyone know of any work on "forgetful laziness?"

The basic idea being that for each thunk instead of overwriting it with the answer as usual in call-by-need, you'd just write a forwarding pointer, and allow GC to collect the answers over time.

This results in recalculation and may be subject to thrashing, so the obvious fix would be either

  1. a 'forget at most once' policy, which would only mitigate the kind of memory leaks you get from laziness under limited conditions, but which has a worst case payout of doubling the workload or
  2. an exponential backoff on how often you'll try to recollect a given value, which should preserve for practical purposes the asymptotic behavior of any algorithm, but with a much larger constant for pathological access patterns. [Edit: it may affect asymptotic behavior, because you could lose sharing]

This would allow the recollection of large CAFs, etc. eventually once they had bitrotted long enough.

Not sure if its worth the cost of recalculating and of storing any backoff counter, but most of the horror stories you hear about Haskell center around its occasional horrific memory usage profile.

Tuning points might include studying average thunk lifetimes to construct thunk access profiles rather than use a naive exponential backoff.

It also may exascerbate the opposite problem where naive code often builds up a tower of thunks it needs to evaluate all at once in order to provide an answer (i.e. when working with a lazy accumulating parameter).

Categories: Offsite Blogs

elem of infinite set of tuple

haskell-cafe - 9 hours 35 min ago
I don't know how Haskell should behave on this. Consider this function: elemOf (x,y) = (x,y) `elem` [ (a,b) | a <- [0..], b <- [0..] ] If I try to query elemOf (1,1), the program keeps searching and searching but it never makes it. But if I query elemOf (0,1) (or anything as long as the first element is 0), it can find it easily. I wonder how it's handled. from (0,0), which will never finish since the limit of the second element is infinite.
Categories: Offsite Discussion

Std lib equivalent for liftM concat . sequence

haskell-cafe - 11 hours 5 min ago
A couple of days ago I had need for: but found no such thing in the std libs, except perhaps for msum (I don't want to add instances for MonadPlus. Should I have to?). Have I missed something trivial? Alistair
Categories: Offsite Discussion

How to use arrays efficiently?

haskell-cafe - 12 hours 58 min ago
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe< at >haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Categories: Offsite Discussion

"Osfameron": More Liverpool talks: Perl, Erlang

Planet Haskell - 13 hours 39 min ago
Thom has posted the details of the next Liverpool geekup, Tuesday May 27th at 3345 Parr Street. Last time, I admitted to programming in Perl and got ribbed about it being unreadable. Fueled by the fervour of the righteous (and maybe a pint or two...

Read and post comments | Send to a friend

Categories: Offsite Blogs

fgl Data.Graph examples

haskell-cafe - 15 hours 40 min ago
Can someone post some simple examples of using the Data.Graph library? So I can define a simple graph let g = buildG (1,2) [(1,2), (2,1)] but how do i label my edges and nodes? i want to be able to name my nodes, and add weights to the edges and be able to update those weights as well. Thanks, Anatoly
Categories: Offsite Discussion

"Osfameron": F# talk in Liverpool

Planet Haskell - 20 hours 29 min ago


I'd been out of Liverpool for 3 years or so, and I completely missed GeekUp: a loosely affiliated, grassroots tech meetup society in the North West. The Liverpool branch is pretty active, and linked with various other groups, such as the DotNet ...

Read and post comments | Send to a friend

Categories: Offsite Blogs

Proposal/Idea: Split up base even further

libraries list - 21 hours 4 min ago
Hello librarians, Here is an idea to resolve some of the problems that remain with the base package: The problem: Users can not update their version of the 'base' package, since it is tied to Ghc. If a useful new function is added using it requires updating the compiler. On the one hand this puts people off using these new functions. On the other hand it also makes updates to for instance Data.List take way longer to reach the users because we have to wait for the next Ghc release. The proposal: User code should NEVER import the library that is now called 'base', it should provide an internal API only. For example, base can include the modules Internal.Data.List Internal.Control.Monad Internal.IO etc. Which are as minimal and low level as possible, containing just the things that Ghc RELIES on like: - type classes for which instances can be derived (Eq, Ord, etc.) - data types and classes with syntactic sugar (Bool, [], Num, Monad) - low level stuff like IO and
Categories: Offsite Discussion

getting output from shell commands

haskell-cafe - Thu, 05/15/2008 - 8:18pm
Hi, I would like to be able to run a shell command like curl and process the output in my Haskell program, but I couldn't find anything helpful about it. Something like main = do args <- getArgs inp <- exec "curl" args putStrLn (processInput inp) would be very helpful. Regards Philip
Categories: Offsite Discussion

-optc-O2 considered useful

glasgow-user - Thu, 05/15/2008 - 7:47pm
I discovered something today I didn't know. gcc -O2 can optimise out the computed jumps GHC produces in tight loops. Consider this program, import Data.Array.Vector import Data.Bits main = print . sumU . mapU (*2) . mapU (`shiftL` 2) $ replicateU (100000000 :: Int) (5::Int) Yields this core: $wfold :: Int# -> Int# -> Int# $wfold = \ (ww_sMp :: Int#) (ww1_sMt :: Int#) -> case ww1_sMt of wild_X10 { __DEFAULT -> $wfold (+# ww_sMp 40) (+# wild_X10 1); 100000000 -> ww_sMp And -O2 -fasm: Main_zdwfold_info: movq %rdi,%rax cmpq $100000000,%rax jne .LcOk movq %rsi,%rbx jmp *(%rbp) .LcOk: incq %rax addq $40,%rsi movq %rax,%rdi jmp Main_zdwfold_info $ time ./sum 4000000000 ./sum 0.19s user 0.00s system 101% cpu 0.188 total -O2 -fvia-C -optc-O: Main_zdwfold_info: cmpq $100000000, %rdi jne .L3 movq
Categories: Offsite Discussion

Write Haskell as fast as C. [Was: Re: GHCpredictability]

haskell-cafe - Thu, 05/15/2008 - 7:31pm
I've written an extended post on how to understand and reliably optimise code like this, looking at it all the way down to the assembly. The result are some simple rules to follow for generated code as good as gcc -O2. Enjoy, http://cgi.cse.unsw.edu.au/~dons/blog/2008/05/16#fast
Categories: Offsite Discussion

runInteractiveCommand: program ends before writingor reading all the output

haskell-cafe - Thu, 05/15/2008 - 6:40pm
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe< at >haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Categories: Offsite Discussion

Type checker loops with innocent looking associated type synonym

glasgow-user - Thu, 05/15/2008 - 2:46pm
I've written this cute ;-) piece of code involving an associated type synonym, {-# OPTIONS_GHC -fglasgow-exts #-} class ZipWithA a where type Elem a :: * zipWithA :: [Elem a] -> a instance ZipWithA [a] where type Elem [a] = a zipWithA xs = xs instance ZipWithA b => ZipWithA ([a] -> b) where type Elem ([a] -> b) = a -> Elem b zipWithA fs = zipWithA . zipWith ($) fs and it behaves as intended: *Main> zipWithA [2, 3] :: [Int] [2,3] *Main> zipWithA [pred, succ] [2, 3] :: [Int] [1,4] *Main> zipWithA [(+), (-)] [2, 3] [5, 7] :: [Int] [7,-4] However, as soon as the overloading cannot be resolved, the type checker loops: *Main> zipWithA [head, last] ["more than", "haskell"] :: String "ml" *Main> zipWithA [head, last] ["more than", "haskell"] < Just to be clear, I do not expect GHC to be able to type this last example, I would just have thought it would want to tell me in finite time that it cannot resolv
Categories: Offsite Discussion

Don Stewart (dons): Write Haskell as fast as C: exploiting strictness, laziness and recursion

Planet Haskell - Thu, 05/15/2008 - 1:00pm

In a recent mailing list thread Andrew Coppin complained of poor performance with "nice, declarative" code for computing the mean of a very large list of double precision floating point values: import System.Environment import Text.Printf mean :: [Double] -> Double mean xs = sum xs / fromIntegral (length xs) main = do [d] <- map read `fmap` getArgs printf "%f\n" (mean [1 .. d])

Which, when executed, churns away for a while, but eventually runs out of memory: $ ghc -O2 --make A.hs $ time ./A 1e9 A: out of memory (requested 1048576 bytes) ./A 1e9 12.74s user 1.20s system 99% cpu 13.997 total

Well, it got 13 seconds into the job doing something. At least it was making progress.

While Andrew was citing this as an example of GHC Haskell being unpredictable (it's not an example of that, as we'll see), there is something here -- this little program reveals a lot about the interaction between strictness, laziness and approaches to compiler optimisation in Haskell.

This post is about writing reliably fast code, and then how to write more naive code that is also reliably fast. In particular, how recursion consistently produces excellent code, competitive with heavily optimised C, and how to get similar results from higher order functions. Throughout this post I'll be using GHC 6.8.2, with -O2. Sometimes I'll switch to the C backend (-fvia-C -optc-O2). Also, I don't care about smarter ways to implement this -- we're simply interested in understanding the compiler transformations involved in the actual loops involved. Understanding strictness and laziness

First, let's work out why this ran out of memory. The obvious hard constraint is that we request a list of 1e9 doubles: that is very, very large. It's 8 * 10^9 bytes, if allocated directly, or about 7.5G. Inside a list there is yet further overhead, for the list nodes, and their pointers. So no matter the language, we simply cannot allocate this all in one go without burning gigabytes of memory. The only approach that will work is to lazily generate the list somehow. To confirm this, we can use a strict list data type, just to see how far we get.

First, let's define a strict list data type. That is one whose head and tail are always fully evaluated. In Haskell, strictness is declared by putting bangs on the fields of the data structure, though it can also be inferred based on how values are used: data List a = Empty | Cons !a !(List a)

This is a new data type, so we'll need some new functions on it: length' :: List a -> Int length' = go 0 where go n Empty = n go n (Cons _ xs) = go (n+1) xs sum' :: Num a => List a -> a sum' xs = go 0 xs where go n Empty = n go n (Cons x xs) = go (n+x) xs enumFromTo' :: (Num a, Ord a) => a -> a -> List a enumFromTo' x y | x > y = Empty | otherwise = go x where go x = Cons x (if x == y then Empty else go (x+1))

I've written these in a direct worker/wrapper transformed, recursive style. It's a simple, functional idiom that's guaranteed to get the job done.

So now we can compute the length and sum of these things, and generate one given a range. Our 'mean' function stays much the same, just the type changes from a lazy to strict list: mean :: List Double -> Double mean xs = sum' xs / fromIntegral (length' xs)

Testing this, we see it works fine for small examples: $ time ./B 1000 500.5 ./A 1000 0.00s user 0.00s system 0% cpu 0.004 total

But with bigger examples, it quickly exhausts memory: $ time ./B 1e9 Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. ./A 1e9 0.02s user 0.02s system 79% cpu 0.050 tota

To see that we can't even get as far as summing the list, we'll replace the list body with undefined, and just ask for the list argument to be evaluated before the body with a bang patterns (a strictness hint to the compiler, much like a type annotation suggests the desired type): mean :: List Double -> Double mean !xs = undefined main = do [d] <- map read `fmap` getArgs printf "%f\n" (mean (enumFromTo' 1 d))

And still, there's no hope to allocate that thing: $ time ./A 1e9 Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. ./A 1e9 0.01s user 0.04s system 93% cpu 0.054 total

Strictness is not the solution here. Traversing structures and garbage collection

So why is the naive version actually allocating the whole list anyway? It was a lazy list, shouldn't it somehow avoid allocating the elements? To understand why it didn't work, we can look at what: mean :: [Double] -> Double mean xs = sum xs / fromIntegral (length xs)

actually means. To reason about the code, one way is to just look at the final, optimised Haskell GHC produces, prior to translation to imperative code. This reduced Haskell is known as "core", and is the end result of GHC's optimisations-by-transformation process, which iteratively rewrites the original source into more and more optimised versions.

Referring to the core is useful if you're looking for C-like performance, as you can state precisely, to the register level, what your program will do -- there are no optimisations to perform that would confuse the interpretation. It is an entirely unambiguous form of Haskell, so we'll use it to clarify any uncertainties. (For everyday programming, a comfortable knowledge of laziness and strictness is entirely adequate, so don't panic if you don't read core!).

To view the core, we can use ghc -O2 -ddump-simpl, or the ghc-core tool. Looking at it for the naive program we see that 'mean' is translated to: $wlen :: [Double] -> Int# -> Int# ... $wsum'1 :: [Double] -> Double# -> Double# ... main = ... shows ( let xs :: [Double] xs = enumFromTo1 lvl3 d_a6h in case $wsum'1 xs 0.0 of { _ -> case $wlen xs 0 of { z -> case ww_s19a /## (int2Double# z) of { i -> D# i )

It looks like a lot of funky pseudo-Haskell, but its made up of very simple primitives with a precise meaning. First, length and sum are specialised -- their polymorphic components replaced with concrete types, and in this case, those types are simple, atomic ones, so Double and Int are replaced with unlifted (or unboxed) values. That's good to know -- unlifted values can be kept in registers.

We can also confirm that the input list is allocated lazily. let xs :: [Double] xs = enumFromTo1 lvl3 d_a6h in ..

The 'let' here is the lazy allocation primitive. If you see it on non-unboxed type, you can be sure that thing will be lazily allocated on the heap. Remeber: core is like Haskell, but there's only 'let' and 'case' (one for laziness, one for evaluation).

So that's good. It means the list itself isn't costing us anything to write. This lazy allocation also explains why we're able to make some progress before running out of memory resources -- the list is only allocated as we traverse it.

However, all is not perfect, the next lines reveal: case $wsum'1 xs 0.0 of { _ -> case $wlen xs 0 of { z -> case ww_s19a /## (int2Double# z) of { i -> D# i

Now, 'case' is the evaluation primitive. It forces evaluation to the outermost constructor of the scrutinee -- the expression we're casing on -- right where you see it.

In this way we get an ordering of operations set up by the compiler. In our naive program, first the sum of the list is performed, then the length of the list is computed, until finally we divide the sum by the length.

That sounds fine, until we think about what happened to our lazily allocated list. The initial 'let' does no work, when allocating. However, the first 'sum' will traverse the entire list, computing the sum of its elements. To do this it must evaluate the entire huge list.

Now, on its own, that is still OK -- the garbage collector will race along behind 'sum', deallocating list nodes that aren't needed anymore. However, there is a fatal flaw in this case. 'length' still refers to the head of the list. So the garbage collector cannot release the list: it is forced to keep the whole thing alive.

'sum', then, will allocate the huge list, and it won't be collected till after 'length' has started consuming it -- which is too late. And this is exactly as we observed. The original poster's unpredictably is an entirely predictable result if you try to traverse a 7G lazy list twice.

Note that this would be even worse if we'd been in a strict language: the initial 'let' would have forced evaluation, so you'd get nowhere at all.

Now we have enough information to solve the problem: make sure we make only one traversal of the huge list. so we don't need to hang onto it. Lazy lists are OK

The simple thing to do then, and the standard functional approach for the last 40 years of functional programming is to write a loop to do both sum and length at once: mean :: [Double] -> Double mean = go 0 0 where go :: Double -> Int -> [Double] -> Double go s l [] = s / fromIntegral l go s l (x:xs) = go (s+x) (l+1) xs

We just store the length and sum in the function parameters, dividing for the mean once we reach the end of the list. In this way we make only one pass. Simple, straight forward. Should be the standard approach in the Haskell hackers kit.

By writing it in an obvious recursive style that walks the list as it is created, we can be sure to run in constant space. Our code compiles down to: $wgo :: Double# -> Int# -> [Double] -> Double# $wgo x y z = case z of [] -> /## x (int2Double# y); x_a9X : xs_a9Y -> case x_a9X of D# y_aED -> $wgo (+## x y_aED) (+# y 1) xs_a9Y

We see that it inspects the list, determining if it is an empty list, or one with elements. If empty, return the division result. If non-empty, inspect the head of the list, taking the raw double value out of its box, then loop, On a small list: $ time ./B 1000 500.5 ./B 1000 0.00s user 0.00s system 0% cpu 0.004 total

On the 7 gigabyte list: $ time ./B 1e9 500000000.067109 ./B 1e9 68.92s user 0.19s system 99% cpu 1:09.57 total

Hooray, we've computed the result! The lazy list got us home.

But can we do better? Looking at the GC and memory statistics (run the program with +RTS -sstderr ) we can see how much work was going on: 24,576 bytes maximum residency (1 sample(s)) %GC time 1.4% (4.0% elapsed) Alloc rate 2,474,068,932 bytes per MUT second Productivity 98.6% of total user

What does this tell us? Firstly, garbage collection made up a tiny fraction of the total time (1.4%), meaning the program was doing productive thing 98.6% of the time -- that's very good.

Also, we can see it ran in constant space: 24k bytes maximum allocated in the heap. And the program was writing and releasing these list nodes at an alarming rate. 2G a second (stunning, I know). Good thing allocation is cheap.

However, this does suggest that we can do better -- much better -- by avoiding any list node allocation at all and keeping off the bus. We just need to prevent list nodes from being allocated at all -- by keeping only the current list value in a register -- if we can stay out of memory, we should see dramatic improvements. Recursion kicks arse

So let's rewrite the loop to no longer take a list, but instead, the start and end values of the loop as arguments: mean :: Double -> Double -> Double mean n m = go 0 0 n where go :: Double -> Int -> Double -> Double go s l x | x > m = s / fromIntegral l | otherwise = go (s+x) (l+1) (x+1) main = do [d] <- map read `fmap` getArgs printf "%f\n" (mean 1 d)

We've moved the list lower and upper bounds into arguments to the function. Now there are no lists whatsover. This is a fusion transformation, instead of two separate loops, one for generation of the list, and one consuming it, we've fused them into a single loop with all operations interleaved. This should now avoid the penalty of the list memory traffic. We've also annotated the types of the functions, so there can be no ambiguity about what machine types are used.

Looking at the core: $wgo_s17J :: Double# -> Int# -> Double# -> Double# $wgo_s17J x y z = case >## z m of False -> $wgo_s17J (+## x z) (+# y 1) (+## z 1.0); True -> /## x (int2Double# y)

it is remarkably simple. All parameters are unboxed, the loop is tail recursive, and we can expect zero garbage collection, all list parameters in registers, and we'd expect excellent performance. The >## and +## are GHC primitives for double comparison and addition. +# is normal int addition.

Let's compile it with GHC's native code generator first (-O2 -fexcess-precision): $ time ./C 1e9 500000000.067109 ./C 1e9 3.75s user 0.00s system 99% cpu 3.764 total

Great. Very good performance! The memory statistics tell a similar rosy story: 20,480 bytes maximum residency (1 sample(s)) %GC time 0.0% (0.0% elapsed) Alloc rate 10,975 bytes per MUT second Productivity 100.0% of total user

So it ran in constant space, did no garbage collection, and was allocating only a few bytes a second. Recursion is the breakfast of champions.

And if we use the C backend to GHC: (-O2 -fexcess-precision -fvia-C -optc-O2), things get seriously fast: $ time ./C 1e9 500000000.067109 ./C 1e9 1.77s user 0.00s system 99% cpu 1.776 total

Wow, much faster. GCC was able to really optimise the loop GHC produced. Good job.

Let's see if we can get this kind of performance in C itself. We can translate the recursive loop directly into a for loop, where function arguments becomes the variables in the loop: #include <stdio.h> #include <stdlib.h> int main(int argc, char **argv) { double d = atof(argv[1]); double n; long long a; // 64 bit machine double b; // go_s17J :: Double# -> Int# -> Double# -> Double# for (n = 1, a = 0, b = 0; n <= d; b+=n, n++, a++) ; printf("%f\n", b / a); return 0; }

That was quite straight forward, and the C is nicely concise. It is interesting to see how function parameters are updated in place explicitly in the C code, while its implicitly reusing stack slots (or registers) in the functional style. But they're really describing the same code. Now, compiling the C without optimisations: $ time ./a.out 1e9 500000000.067109 ./a.out 1e9 4.72s user 0.00s system 99% cpu 4.723 total

Ok. That's cool. We got the same results, and optimised Haskell beat unoptimsed C. Now with -O, gcc is getting closer to catch GHC: $ time ./a.out 1e9 500000000.067109 ./a.out 1e9 2.10s user 0.00s system 99% cpu 2.103 total

And with -O2 it makes it in front by a nose: $ time ./a.out 1e9 500000000.067109 ./a.out 1e9 1.76s user 0.00s system 99% cpu 1.764 total

gcc -O2 wins the day (barely?), just sneaking past GHC -O2, which comes in ahead of gcc -O and -Onot.

We can look at the generated assembly (ghc -keep-tmp-files) to find out what's really going on.. GCC generates the rather nice: .L6: addsd %xmm1, %xmm2 incq %rax addsd %xmm3, %xmm1 ucomisd %xmm1, %xmm0 jae .L6 .L8: cvtsi2sdq %rax, %xmm0 movl $.LC2, %edi movl $1, %eax divsd %xmm0, %xmm2

Note the very tight inner loop, and final division. Meanwhile, GHC produces, with its native code backend: s1bn_info: ucomisd 5(%rbx),%xmm6 ja .Lc1dz movsd %xmm6,%xmm0 addsd .Ln1dB(%rip),%xmm0 leaq 1(%rsi),%rax addsd %xmm6,%xmm5 movq %rax,%rsi movsd %xmm0,%xmm6 jmp s1bn_info .Lc1dz: cvtsi2sdq %rsi,%xmm0 divsd %xmm0,%xmm5 Quite a bit more junk in the inner loop, which explains the slowdown with -fasm. That native code gen needs more work for competitve floating point work. But with the GHC C backend: s1bn_info: ucomisd 5(%rbx), %xmm6 ja .L10 addsd %xmm6, %xmm5 addq $1, %rsi addsd .LC1(%rip), %xmm6 jmp s1bn_info .L10: cvtsi2sdq %rsi, %xmm7 divsd %xmm7, %xmm5

Almost identical to GCC, which explains why the performance was so good! Some lessons

Lesson 1: To write predictably fast Haskell -- the kind that competes with C day in and out -- use tail recursion, and ensure all types are inferred as simple machine types, like Int, Word, Float or Double that simple machine representations. The performance is there if you want it.

Lesson 2: Laziness has an overhead -- while it allows you to write new kinds of programs (where lists may be used as control structures), the memory traffic that results can be a penalty if it appears in tight inner loops. Don't rely laziness to give you performance in your inner loops.

Lesson 3: For heavy optimisation, the C backend to GHC is still the way to go. Later this year a new bleeding edge native code generator will be added to GHC, but until then, the C backend is still an awesome weapon.

In a later post we'll examine how to get the same performance by using higher order functions.

Categories: Offsite Blogs