GHC

md5 :: LazyByteString -> MD5Ctx

Submitted by TomMD on Wed, 11/07/2007 - 7:14pm.

Introduction

In June 2007 I developed a Haskell MD5 implementation intended to be competitive with the 'C' version while still remaining readable. This is a lazy ByteString implementation that makes heavy use of inlining, unboxing and strictness annotations. During this process I [re?]discovered some querks about GHC optimizations and Data.Binary.

The Facts

The first relatively operational version used unsafePerformIO to extract the Word32s from the ByteString. Minor modifications were made to use Data.Binary - this is, after all, an intended use of Data.Binary.

There are four incarnations. First, there is the manual round unrolling and a foldl' version; second, there is a version that uses Data.Binary to extract the Word32s and another that uses unsafePerformIO + Ptr operations.

I standardized the benchmark as the time it takes to hash a 200MB file (seconds) and show performance relative to standard md5sum.

GHC 6.6.1 results

MD5.unroll.unsafe: 5s 6x MD5.unroll.Binary: 10s 12x MD5.rolled.Binary: 45s 58x MD5.rolled.unsafe: 27s 35x md5sum: 0.78s 1x, by def.

UnsafeIO v. Binary

Sadly, changing the (getNthWord32 :: Int -> ByteString -> Word32) function from unsafePerformIO to using Data.Binary made the system take twice as long. It was my hope that Data.Binary was optimized to the same degree as its unsafePerformIO counterpart.

Manual unrolling? This isn't (insert hated language/compiler)

As you can see, the manual unrolling has saved quite a bit of time. Why did unrolling save computation time? I was optimistic in hoping GHC would eliminate unneeded arrays/accesses when unfolding (among other optimizations).

Profiling funkyness

When profiling (-p -auto-all), the 'Binary' version attributes much memory to 'applyMD5Rounds' while the 'unsafe' version attributes more to the ff,gg,hh,ii functions. I am guessing the 'unsafe' version is correct... is the program profile using Data.Binary wrong?

Dreams of SuperO

I am not trying to be Data.Binary/GHC centric. Matter of fact, I look forward to seeing what YHC/supero would do - I just need to be able to get it (and know how to use it).

Observations from core (-ddump-simpl)

I am not patient or knowledgeable enough to know if my understanding of core is correct... that doesn't stop me from interpreting it. It looks like the programs are boxing/unboxing the context between every fold of foldl'! If this is True it explains some of the performance issues. Folding over rounds would cost 64 box/unboxes per block in the rolled version (once for every byte hashed). Folding between each round would cost one box/unbox per block even in the unrolled version (32 million box/unboxings when hashing 200MB). If this is true, it is an obvious place for performance improvement

Minor Alterations With Unexpected Results

Eliminating md5Finalize resulted in a sub 2s run (>2x speed increase, ~3x slower than 'C'). Finalize only runs once (at the end of the hash computation) and is profiled at 1 tick. The only explanation I can see is that md5Finalize use of md5Update complicates the optimization. Inlining md5Update doesn't help and neither does making/using identical copies of md5Update (md5Update') and all sub-functions.

Edit: Benchmark includes nanoMD5 too!

GHC 6.8.0.20070914 Results

Time v. 'C' v. GHC 6.6.1 md5.unroll.unsafe 5s 6x 1x md5.roll.unsafe 17s 21x 0.63x md5.nano 0.9s 1.15x -

So I see a significant improvement in the rolled version thanks to everyone involved in GHC.

Any suggestions on how to close the remaining performance gap would be welcome.

Style

Yes, I know, it looks ugly! I don't feel like cleaning it up much right now. If someone voices that they care, that would be a motivator.

Summary (new from edit)

  • 6x slower than 'c'
  • 5 ish times slower than 'nanoMD5'.
  • Could be twice as fast, assuming I am right about this compiler bug.
  • All Haskell
  • I discovered that it is loading up the entire damn file to memory (new since I last made sure it didn't). So I'll be fixing that stupid bug.

    Footer (for the really curious)

    Hardware: 2.2Ghz AMD x86_64 (Athlon 3400+) with 1GB of memory. Software: Linux 2.6.17, GHC (6.6.1 and 6.8.0 as indicated), md5sum 5.97.

    flags: -O2 -funfolding-use-threshold66 -funfolding-creation-threshold66 -funfolding-update-in-place -fvia-c -optc -funroll-all-loops -optc-ffast-math

    CODE

    EDIT: Now on darcs:

    darcs get http://code.haskell.org/~tommd/pureMD5

  • Glasgow Haskell Compiler gains SMP support

    Submitted by shapr on Tue, 05/10/2005 - 1:06am.

    In a recent irc discussion with one of the developers of the Glorious Glasgow Haskell Compiler, it was heard that the development version of GHC can now use multiple processors for the same program.

    Largest number of cores in a single machine you own? (GHC is getting SMP support)

    Submitted by shapr on Tue, 05/10/2005 - 12:59am.
    1
    0% (0 votes)
    2
    100% (1 vote)
    4
    0% (0 votes)
    8
    0% (0 votes)
    I work for IBM, nothing less than sixteen cores on my desk...
    0% (0 votes)
    Total votes: 1

    GHC 6.4 is released!

    Submitted by jgoerzen on Fri, 03/11/2005 - 6:49am.

    The event we've all been waiting for!

    GHC 6.4 is released! Click the link for the full announcement as well as download links. There are also release notes available.

    GHC survey announced

    Submitted by simonmar on Fri, 03/04/2005 - 9:02am.

    The GHC Team announced a user survey giving you the chance to comment on all aspects of GHC from your favourite features, wishlist items, to the development model.

    hs-plugins 0.9.8 released

    Submitted by shapr on Wed, 02/23/2005 - 3:21am.

    Don Stewart's hs-plugins is a cool tool for dynamically loading haskell modules at runtime. hs-plugins history begins with GHCi, which was the first interpreter-like software for GHC. Some time later, Andre Pang wrote the RuntimeLoader, which turned GHCi into a library allowing any Haskell application to do dynamic loading. RuntimeLoader was very close to GHCi, the user needed to handle library dependencies themselves, for example. hs-plugins is a complete solution to dynamic loading in Haskell, it has lots of features and can do all sorts of nifty things.

    hs-plugins includes spiffy features like eval, make, and load. The eval function does compiling, loading, and executing a single string's worth of code. the make function checks a source module and its dependencies for any changes since the last compilation, and recompiles and reloads all changed modules.

    hs-plugins is already used in several applications. One especially nice example is the Yi editor from the same author, where hs-plugins is used to allow the entire application to be dynamically reloaded from a single Boot.hs startup file.