TomMD's blog

Xen and Fedora

Submitted by TomMD on Wed, 03/19/2008 - 7:36pm.

So this is a little off topic for sequence.complete.org, but I'm going to document some of my run-ins with Xen and Fedora here. Also, I am looking for somewhere more generic to move my blog - any suggestions? EDIT: I just decided to go with wordpress, so you can find my ramblings at tommd.wordpress.com

Installing Xen
I had some issues with installing/using Xen from source. I mudded the waters at the start by trying out various versions / packagings (Fedora Xen package, xen-3.0.4-testing, xen-3.2-testing, xen-3.2.0) and I discovered that:

  1. 'make clean' will remove ALL xen kernels (rm /boot/*xen*). This eliminated a perfecting fine xen-3.0.4 install when I was screwing around with 3.2-testing.
  2. 'make clean' missed some python scripts - causing the xm tool funky behavior till I manually cleaned /usr/lib and reinstalled Xen.
  3. Fedoras Xen package is fine (currently a Xen-3.1 derivative) , unless you have special desires like ACM - in which case you certainly need to use a source distribution. Good work Fedora!

Installing Fedora PVMs
Using the virt-install tool (package: python-virtinst) it is dead simple to install stable Fedora distros (of coarse the GUI tool, virt-manager, exists too):


# virt-install -n F8 -r 512 --vcpus=1 -f /var/lib/xen/images/F8.img \
--vnc -p -b virbr0 -l "http://mirrors.kernel.org/fedora/releases/8/Fedora/i386/os/"

But be sure to enable packet forwarding by adding to your /etc/sysctl.conf. If you don't, you'll find the packets won't get forwarded from your bridge (xenbr0 or virbr0) to your network.


# Controls IP packet forwarding
net.ipv4.ip_forward = 1

Or use sysctl or directly alter /proc/sys/net/ipv4/ip_forwarding, if you are rustic like me.

Dealing with Rawhide
I figured I could be a rawhide tester by running it as a DomU - I was right and wrong.

  1. Rawhide can not act as Dom0. It seem that, due to the PV_Ops task, Fedora 9 is not intended as Dom0 - Fedora 8 and 10 will fill the host role.
  2. Rawhide doesn't currently run as a paravirtual machine (PVM - a type of DomU). I've reported bug 437217, which has an obvious fix, and bug 437706, which is where I'm at now.

An Exersize in Over-Engineering

Submitted by TomMD on Wed, 02/20/2008 - 9:44pm.

Intro

Inspired by an old article on over engineering I decided to try and over design / bloat an event system (aka a timeout system). This is the same system I posted about earlier, but with fewer features and many more LOC.

Conception

The initial concept was clean, simple. It would handle hundreds of events, be useful for 80% of the cases (I do like the 80% rule). My event system state was this:

data EventState = EvtState { esEvents :: [Event], esAlarm :: ClockTime, esNextId :: EventId }

Each new event gets a new Id number, the 'NextId' is incremented, and the event is added to the time ordered list, 'esEvents'. 'esAlarm' is updated as necessary (atomically by the thread adding the event with an earlier expire time).

Functionally, there are 2 threads in the system and 'n' threads of event producers. One thread, eventExpire, would threadDelay until the soonest event needs to be expired. The other thread, watchAlarm, would keep an eye on esAlarm - if it was changed to an earlier time then an exception would be thrown to the eventExpire thread.

Theoretical Bugs!

Oh no! You could theoretically overflow the Int used to express EventId. I could use Integer, but that still has a theoretical limit and the memory used for ids would grow (eventually). So, I set off to redefine EventId from an Int to a (ClockTime,Int) combo. How do I know this doesn't overflow? Because my new state field which keeps track of the next event Id for each ClockTime (less the expired times), and if it does overflow for a given clock time I refuse to add the event.

Realistic Performance Issues

That's right! A close runner up for theoretical bugs is fixing performance. Why is this slow for large numbers of events? Well, adding events is O(n), expiring events can are in order in the list, and updating the alarm time is O(1).

Lets turn _everything_ into a mapping. This way most operations of interest are ~O(ln(n)) The new state will look like this:

type EventSet = (EventNumber, Map EventNumber (IO ())) data EventSystem = EvtSys { esEvents :: TVar (Map ClockTime EventSet), -- Pending Events esThread :: TVar (Maybe ThreadId), -- Thread for TimerReset exceptions esAlarm :: TVar ClockTime, -- Time of soonest event esNewAlarm :: TVar Bool, -- new/earlier alarm }

What is this in English? esEvents is a mapping of ClockTimes to a mapping of EventNumbers to IO Actions (and the next unique EventNumber is in there). Also, there is now esNewAlarm, so the alarmWatch can avoid throwing expires unnecessarily.

All in all this does OK, 3 seconds to add and execute 100,000 actions (vs a previous 38 seconds for 10,000).

Fixing Dropped Events

Now we have eliminated theoretical bugs and improved performance - whats left? Fixing real bugs of coarse! When running the test code (shamelessly stolen from control-timeout - thanks Adam!) I found some tests never finished and no more events were pending. The cause seems to be that between the event list being pruned of old events and those events being executed the thread was getting an exception (a new event was added with an earlier alarm). The solution is yet another thread and TVar (that's right, I've an irrational dislike of 'block'). Old events are atomically pruned and written to an 'expired' list - then the third thread, monitorExpiredQueue, can zero the queue and execute the actions. My new state has the one more TVar:

esExpired :: TVar [[EventSet]]

A Useful API

So now that I've finished all this, and we are two or three days after my original post, I need to make the API something useful. Current code looks like:

sys <- initEventSystem (TOD sec ps) <- getClockTime eid <- addEvent sys (TOD (sec+delay) ps) evt ... if dontNukeThem then cancelEvent sys eid else return () -- explosives concept borrowed from recent -cafe example

Adam had a better idea for most use cases in that they work based on deltas. So I wrote a shim to provide the Control.Timeout API with the Control.Event engine (Control.Event.Timeout).

eid <- addTimeout act delay if dontNukeThem then cancelTimeout eid else return ()
And an informal performance comparison is:
   Control.Event can add 100K events in 3  seconds
   Control.Timeout       100K           12 seconds
   Control.Event.Timeout 100K           58 seconds

Control.Event.Timeout calls getClockTime on every event add, so driving the Control.Event code with time deltas is really slow. Control.Timeout is obviously very good with time deltas.

Control.Event Released

Submitted by TomMD on Sat, 02/16/2008 - 11:22am.

I've released control-event (on hackage.haskell.org), which provides a system to schedule and cancel IO events for given System.Time.ClockTimes. While designed prior to the release of Control.Timeout, it is very similar, the main differences are:

  1. Control.Event requires initilization while Control.Timeout hides this via unsafePerformIO.
  2. Control.Event can add events in the STM monad using addEventSTM. While Control.Timeout has an addEventAtomic, they aren't drop in replacements for each other (addEventAtomic is an IO action that returns an STM operation).
  3. Control.Timeout externally interfaces by time deltas (Float) while Control.Event is based on absolute ClockTimes.
  4. Control.Timeout uses only one Haskell thread while Control.Event needs two.
  5. Control.Event will not duplicate EventIds once the counter overflows. The possibility of this happening in Control.Timeout is entirely theoretical - I'm not claiming there is a practical difference.

Future Changes:

  • I'll either expand or eliminate the extra features of Event Preprocessing.
  • If no one uses it, I'll probably be getting rid of the ability to run expired actions in the event system thread.
  • I'm guessing Data.Time works on Windows. So long as I don't learn otherwise, I'll move from the old-time package to the 'time' package.
  • I'll add a shim to allow adding events using time deltas.

EDIT: I should add that Control.Timeout performs MUCH better.

Let me know if you find this useful!

Improving Haskell MD5

Submitted by TomMD on Fri, 11/09/2007 - 7:42pm.

I cleaned up the code!

  • Fixed the ugly 'blocks' function. This results in better performance and fixed the space leak!
  • Changed the license from GPL2 to BSD3.
  • Eliminating the 'finalize' doesn't give any further performance, so I think this was a really odd bug in my code, not GHC.

    Again, the code is at:
    darcs get http://code.haskell.org/~tommd/pureMD5

    And here is the quick and dirty performance of the unrolled version:

    $ time ./hsMD5 ../2GB
    b8f1f1faa6dda5426abffb3a7811c1fb

    real 1m22.685s
    user 0m31.730s
    sys 0m4.676s

    $ time md5sum ../2GB
    b8f1f1faa6dda5426abffb3a7811c1fb ../2GB

    real 1m21.343s
    user 0m25.330s
    sys 0m4.528s

    So, at least on my clunker, the Haskell code shows good time on the 2GB test.


    $ time ./hsMD5 ../200MB
    eda9a9889837ac4bc81d6387d92c1bec

    real 0m2.267s
    user 0m2.148s
    sys 0m0.116s

    $ time md5sum ../200MB
    eda9a9889837ac4bc81d6387d92c1bec ../200MB

    real 0m0.775s
    user 0m0.704s
    sys 0m0.068s

    And its just under 3 time slower than C on that one.

  • md5 :: LazyByteString -> MD5Ctx

    Submitted by TomMD on Wed, 11/07/2007 - 7:14pm.

    Introduction

    In June 2007 I developed a Haskell MD5 implementation intended to be competitive with the 'C' version while still remaining readable. This is a lazy ByteString implementation that makes heavy use of inlining, unboxing and strictness annotations. During this process I [re?]discovered some querks about GHC optimizations and Data.Binary.

    The Facts

    The first relatively operational version used unsafePerformIO to extract the Word32s from the ByteString. Minor modifications were made to use Data.Binary - this is, after all, an intended use of Data.Binary.

    There are four incarnations. First, there is the manual round unrolling and a foldl' version; second, there is a version that uses Data.Binary to extract the Word32s and another that uses unsafePerformIO + Ptr operations.

    I standardized the benchmark as the time it takes to hash a 200MB file (seconds) and show performance relative to standard md5sum.

    GHC 6.6.1 results

    MD5.unroll.unsafe: 5s 6x MD5.unroll.Binary: 10s 12x MD5.rolled.Binary: 45s 58x MD5.rolled.unsafe: 27s 35x md5sum: 0.78s 1x, by def.

    UnsafeIO v. Binary

    Sadly, changing the (getNthWord32 :: Int -> ByteString -> Word32) function from unsafePerformIO to using Data.Binary made the system take twice as long. It was my hope that Data.Binary was optimized to the same degree as its unsafePerformIO counterpart.

    Manual unrolling? This isn't (insert hated language/compiler)

    As you can see, the manual unrolling has saved quite a bit of time. Why did unrolling save computation time? I was optimistic in hoping GHC would eliminate unneeded arrays/accesses when unfolding (among other optimizations).

    Profiling funkyness

    When profiling (-p -auto-all), the 'Binary' version attributes much memory to 'applyMD5Rounds' while the 'unsafe' version attributes more to the ff,gg,hh,ii functions. I am guessing the 'unsafe' version is correct... is the program profile using Data.Binary wrong?

    Dreams of SuperO

    I am not trying to be Data.Binary/GHC centric. Matter of fact, I look forward to seeing what YHC/supero would do - I just need to be able to get it (and know how to use it).

    Observations from core (-ddump-simpl)

    I am not patient or knowledgeable enough to know if my understanding of core is correct... that doesn't stop me from interpreting it. It looks like the programs are boxing/unboxing the context between every fold of foldl'! If this is True it explains some of the performance issues. Folding over rounds would cost 64 box/unboxes per block in the rolled version (once for every byte hashed). Folding between each round would cost one box/unbox per block even in the unrolled version (32 million box/unboxings when hashing 200MB). If this is true, it is an obvious place for performance improvement

    Minor Alterations With Unexpected Results

    Eliminating md5Finalize resulted in a sub 2s run (>2x speed increase, ~3x slower than 'C'). Finalize only runs once (at the end of the hash computation) and is profiled at 1 tick. The only explanation I can see is that md5Finalize use of md5Update complicates the optimization. Inlining md5Update doesn't help and neither does making/using identical copies of md5Update (md5Update') and all sub-functions.

    Edit: Benchmark includes nanoMD5 too!

    GHC 6.8.0.20070914 Results

    Time v. 'C' v. GHC 6.6.1 md5.unroll.unsafe 5s 6x 1x md5.roll.unsafe 17s 21x 0.63x md5.nano 0.9s 1.15x -

    So I see a significant improvement in the rolled version thanks to everyone involved in GHC.

    Any suggestions on how to close the remaining performance gap would be welcome.

    Style

    Yes, I know, it looks ugly! I don't feel like cleaning it up much right now. If someone voices that they care, that would be a motivator.

    Summary (new from edit)

  • 6x slower than 'c'
  • 5 ish times slower than 'nanoMD5'.
  • Could be twice as fast, assuming I am right about this compiler bug.
  • All Haskell
  • I discovered that it is loading up the entire damn file to memory (new since I last made sure it didn't). So I'll be fixing that stupid bug.

    Footer (for the really curious)

    Hardware: 2.2Ghz AMD x86_64 (Athlon 3400+) with 1GB of memory. Software: Linux 2.6.17, GHC (6.6.1 and 6.8.0 as indicated), md5sum 5.97.

    flags: -O2 -funfolding-use-threshold66 -funfolding-creation-threshold66 -funfolding-update-in-place -fvia-c -optc -funroll-all-loops -optc-ffast-math

    CODE

    EDIT: Now on darcs:

    darcs get http://code.haskell.org/~tommd/pureMD5