News aggregator

FP Complete: Efficient binary serialization

Planet Haskell - Mon, 03/14/2016 - 1:45am

FP Complete has multiple openings for software and devops engineers. If you're interested in working on Haskell, container-based deployment, build systems, or just about anything you read about on this blog, send us your CV.

We do a lot of work at FP Complete in the realm of multi-machine computing for our customers. One of the things we need to do efficiently to get good speedups from adding more machines is optimizing serialization. Since our multi-machine work often revolves around scientific and mathematical computing, we primarily need to make sure that there is efficient serialization of primitive data types (like Int64 and Double), strict/unpacked product types (e.g. data Foo = Foo !Int64 !Double), and vectors of any of these.

For the impatient: benchmark results and repo

Not too long ago Francesco discovered and reported some issues with the binary package's serialization of Doubles. In response to this, we internally switched over to using the cereal package instead, which resulted in a few more performance related PRs. However, when the FP Complete team discussed serialization, and particularly the needs that our multi-machine computing and other client projects require, we realized that we may have some opportunities to do better.

Both the binary and cereal packages are very featureful, and with these features come some overhead. We actually have much simpler requirements for our standard serialization cases, e.g.:

  • There's no need for backwards compatible formats, since two identical executables will be running on two machines and will need to communicate with each other
  • Similarly, we don't need to have a consistent endianness for our encoding; using host byte order will always work for our use case
  • Given the structure of data we work with, we will almost always have a very cheap method of determining the serialized size of a piece of data. For example, a vector of 50 Doubles will be one Int64 (to hold the length of the vector) plus 50 * sizeOf (undefined :: Double), or in other words: 51 * 8. By leveraging this, we can possibly do a more efficient serialization step by allocating a buffer of exactly the correct size
  • We can get away with requiring that all data be in memory, since all of our use cases require that the data be fully evaluated at once (no lazy processing of larger-than-memory data sets)
  • There's no need for any kind of fancy deserialization involving backtracking

With these ideas in mind, I've spent the past few days putting together a variety of different implementations of both encoding and decoding, and believe that for the limited cases I just described, we can get significantly better serialization performance.

Vector SomeData

The benchmarks I've used for this experiment focus on a fairly simple encoding and decoding test for a boxed vector of SomeData values. The SomeData type has a single data constructor with strict fields of Int64, Word8, and Double. This is a fairly good representation of the kind of data that we deal with regularly. Serializing more complex datatypes, or more variable-length types like a ByteString, are certainly supported by all of the schemes I've described here, but the performance impact of that has not been tested (since it's not the main focus of our work).

Representations

There are three different binary representations of the data used in this benchmark suite:

  • Simple little-endian (host-endian) serialization. In this case, the values are serialized in exactly the same format as they are stored in memory. (This assumes of course a little-endian architecture, which is what I did all testing on and our target architecture in production in almost all cases.) Vectors are serialized by storing the length of the Vector as an Int64 followed by each successive element from the Vector. This format is used by the following functions:

    • encodeSimpleRaw
    • encodeSimplePoke
    • encodeSimplePokeMonad
    • encodeSimplePokeRef
    • encodeSimplePokeRefMonad
    • encodeBuilderLE
    • decodeSimplePeek
    • decodeSimplePeekEx
    • decodeRawLE
  • The big-endian format of the above. This format is used by the following functions:

    • encodeBuilderBE
    • encodeCereal
    • decodeRawBE
    • decodeCereal
  • The binary format, used exclusively by the binary package. The special magic in this format is that Doubles are encoded as a pair of Integer and Int. This is the original performance bug that Francesco found and reported on Reddit, and which kicked off this whole analysis.

Let's start with the benchmark results, and then do some more analysis:

Unsurprisingly, binary performed the worse by far, mainly due to its Double serialization. Similarly, the big endian approaches were all slower than the little endian approaches, though the cereal package performed significantly worse than the raw bytestring-builder encoding and direct ByteString/bit-shifting decoding.

Since the little endian encoding wins as far as representation, we'll spend the rest of this approach analyzing the different trade-offs in encoding and decoding, based on the 6 encoding and 3 decoding functions implemented and tested here.

Continuation vs mutable reference

When both encoding and decoding, we need to keep track of the current offset being written to/read from within the buffer. In the case of decoding, we also need to have some way to fail out if the data parsed is not what we expect, or there are insufficient bytes to consume. I tested two approaches for this:

  • A continuation-based approach, where each computation is passed the next computation to perform. That next computation takes a parameter of the updated offset, and in the case of decoding returns a Maybe value to allow for failure. This technique is used by the functions:

    • encodeSimpleRaw
    • encodeSimplePoke
    • encodeSimplePokeMonad
    • decodeSimplePeek
    • decodeRawLE
  • A mutable reference approach. In this case, instead of each function needing to take an updated offset value, the value is stored in a mutable reference. This allows the environment available to each function to be identical (i.e., a pure ReaderT setup), which GHC is good at optimizing. On the other hand, GHC also tends to do much better at optimizing code without mutable references in it. To deal with failed decoding in this setup, we use runtime exceptions. This technique is used by the functions:

    • encodeSimplePokeRef
    • encodeSimplePokeRefMonad
    • decodeSimplePeekEx

As you can see from the results, the continuation based approach gave a slight performance advantage. While this is what I would have expected, the difference between the two was less than I had expected. Additionally, in some earlier testing before I'd added more optimization, the reference based implementation actually outperformed the continuation based implementation. The details of what's going on at the core level would be an interesting follow-up analysis.

Optimizing the mutable reference

If you dive into the code, you may be a bit surprised at how the offset reference is represented. Instead of a more standard IORef, we have:

newtype OffsetRef = OffsetRef (Data.Vector.Unboxed.Mutable.MVector RealWorld Offset)

The idea here is to take advantage of the more efficient unboxed representation and byte array operations provided by the vector package and GHC. This also avoids a lot of garbage collection overhead used by the boxed nature of IORef. This is the same technique leveraged by the mutable-containers package. Also, check out the Stack Overflow question on the topic.

Storable, bit-shifting, and bytestring-builder

When it comes to storing primitive datatypes in a pointer, it's unsurprising to see the Storable type class. This class's peek and poke operations are implemented under the surface with GHC primops. We get away with this because, in our implementation, we are always using host byte order. By contrast, the cereal, binary, and bytestring-builder packages need to do more direct bit-shifting, such as:

word64be :: ByteString -> Word64 word64be = \s -> (fromIntegral (s `SU.unsafeIndex` 0) `shiftL` 56) .|. (fromIntegral (s `SU.unsafeIndex` 1) `shiftL` 48) .|. (fromIntegral (s `SU.unsafeIndex` 2) `shiftL` 40) .|. (fromIntegral (s `SU.unsafeIndex` 3) `shiftL` 32) .|. (fromIntegral (s `SU.unsafeIndex` 4) `shiftL` 24) .|. (fromIntegral (s `SU.unsafeIndex` 5) `shiftL` 16) .|. (fromIntegral (s `SU.unsafeIndex` 6) `shiftL` 8) .|. (fromIntegral (s `SU.unsafeIndex` 7) )

Leveraging Storable whenever possible simplifies our codebase and makes it more efficient. In fact, the Simple typeclass - used for most of our implementations here - provides default implementations of all functions in terms of Storable, so that the instances for primitive types just look like:

instance Simple Int64 instance Simple Word8 instance Simple DoublesimpleSize :: Either Int (a -> Int)

You may be wondering: if Storable is so great, why not just base a serialization library on it directly? There's one downside: it assumes that every datatype has a fixed serialized width. While this works great for Ints and the like, it fails with a Vector, where we need to know - at runtime - the actual length of the Vector to determine its serialized size.

A naive implementation of this would look like:

simpleSize :: a -> Int

However, this signature implies that every type's size may change at runtime. This would require that, when serializing a Vector, we always inspect every single value, which introduces significant CPU overhead. For example, given that the representation of a Vector is an 8-byte integer length plus the sizes of all elements, this would look like:

simpleSize :: Vector a -> Int simpleSize v = 8 + V.sum (V.map simpleSize v)

But in the case of statically-known sizes, we'd like to replace that sum with a simple multiplication. To make this work, we have a slightly smarter type signature for the size function:

simpleSize :: Either Int (a -> Int)

If this function returns Left, there's no need to inspect individual values. For Right, we're stuck with checking each thing. Putting this together, here's our implementation for Vectors.

simpleSize :: Either Int (Vector a -> Int) simpleSize = Right $ \v -> case simpleSize of Left s -> s * V.length v + 8 Right f -> V.sum (V.map f v) + 8

Notice how, for a Vector Int64, we'd be able to follow the Left branch and avoid the costly sum. This ends up giving us really cheap knowledge of the total buffer size needed, which we take advantage of when encoding, e.g.:

encodeSimpleRaw :: Simple a => a -> ByteString encodeSimpleRaw x = unsafeCreate (either id ($ x) simpleSize) (\p -> simpleRawPoke p 0 x)Monadic vs Monoidal interface

Ever since blaze-html famously provided a broken Monad interface in the name of performance, there's been a concern (at least by me) that providing a proper Monad instance instead of just a Monoid instance could slow things down. Therefore, this benchmark included encoding functions that are both monadic (encodeSimplePokeMonad and encodeSimplePokeRefMonad) and monoidal (encodeSimplePoke and encodeSimplePokeRef). To my surprise, they performed nearly identically.

The monadic interface though has many advantages over a monoidal one:

  • You can still provide a Monoid instance for any Monad, so you actually get both interfaces for free
  • You get to use do notation if desired
  • Subjectively, I'd guess that people are more used to monadic combinators (e.g., mapM_ vs foldMap).
  • In my testing, I found a major slowdown due to the usage of foldMap from Vector. I'm guessing this is due to a missing INLINE in the default implementation of foldMap in Data.Foldable, but I haven't had a chance to investigate yet. Again, since monadic combinators seem to be more well used, odds are that they will also be more well optimized.
Overall abstraction overhead

In addition to the nice newtype-wrapped monads and monoids, I implemented a raw version of both encoding and decoding, just to see if the abstraction added an overhead. Fortunately, this was not the case, which lets us stick with relatively simple high-level code such as:

simplePeek = SomeData <$> simplePeek <*> simplePeek <*> simplePeek

instead of:

readInt64 bs $ \bsX x -> readWord8 bsX $ \bsY y -> readDouble bsY $ \bsZ z -> do MV.unsafeWrite mv idx (SomeData x y z) loop (idx + 1) bsZST-like monad

If you look at the implementation of the Binary instance for Vector, you'll see that it needs to resort to unsafePerformIO. This is because, in order to efficiently create a Vector, it needs to perform mutations, but the Get monad from binary does not provide any ST or IO like behavior for mutating a buffer.

Since we're designing a new decoding interface from scratch, and have Vector as a first-class use case, I decided to bake in ST-like semantics to allow directly playing around with mutable vectors. As a result, the type of Peek has an extra state token parameter, just like ST:

newtype Peek s a

Also, there is a PrimMonad instance for Peek, which allows for the mutable vector operations to occur within the monad without any lifting. To see how this works, take a look at the implementation of simplePeek for Vectors:

simplePeek :: Simple a => Peek (Vector a) simplePeek = do len :: Int64 <- simplePeek let len' = fromIntegral len mv <- MV.new len' let loop i | i >= len' = V.unsafeFreeze mv | otherwise = do x <- simplePeek MV.unsafeWrite mv i x loop $! i + 1 loop 0

All of this also applies to the PeekEx type.

Results

The criterion link I shared above has the results of all of these benchmarks, but for the lazy, here's the result graph again:

As you can see: the little-endian encoding regularly outperformed the other two formats by a significant margin. This isn't surprising, given that there is less CPU work that needs to be done, and that we are able to leverage primops from GHC to do most of the work.

Similarly, while bytestring-builder is a very highly tuned library, for a narrow use case like ours where the resulting buffer size is easily computed in advance, constructing ByteStrings manually gives a significant performance boost.

To my surprise, the mutable reference/exception based approach to peeking and poking was fairly competitive with the continuation-based approach. Nonetheless, overall the continuation-based approach outperforms it, and is more elegant an approach.

While the monadic and monoidal interfaces for poking/encoding give similar results, the performance of the monoidal interface can be unpredictable if used incorrectly. Since monadic combinators are more commonly optimized and more ubiquitous, it's safer to expose the monadic interface. (And, for what it's worth, we can always provide a Monoid instance for any Monad.)

And thankfully, the abstractions we use to make our encoding and decoding easy to work with and nicely composable do not introduce a slowdown versus the low-level implementation. So no need to force people to drop down to manual continuation passing.

What's next

Based on this analysis, it seems like we have a clear set of choices for a new serialization library. This blog post already clarifies the design goals of such a library, and its limitations (e.g., no streaming decoding). Such a library is something we'll almost certainly want to leverage for our multi-machine computation work at FP Complete.

I'm interested on feedback from others on this topic, including ways to better optimize the code, alternative strategies, and additional use cases you might have for such a library.

Categories: Offsite Blogs

ANNOUNCE: Frown, Ralf Hinze's LALR(k) parser generator for Haskell, now slightly upkept ☺

haskell-cafe - Sun, 03/13/2016 - 8:05pm
Some years ago, Ralf Hinze wrote this parser generator [0], and i have since been using it, so i cleaned it up and uploaded it to Hackage as "frown" [1]. My repo is here: https://github.com/strake/frown [0] http://www.cs.ox.ac.uk/ralf.hinze/frown/ [1] https://hackage.haskell.org/package/frown
Categories: Offsite Discussion

Module name to package name index

haskell-cafe - Sun, 03/13/2016 - 1:10pm
Is there something available which can answer questions like this - what all packages on Hackage export "Data.List" module? Another useful question could be - what all packages export symbol named 'x'? Hackage search or hoogle are not helpful in providing a precise answer to these. I am assuming that related functionality usually lands up in the same part of the module namespace. This fact can help us in discovering related or alternative packages or functions available in a given area. For example when I am looking for list related functions I can search for who all exports "Data.List". I guess creating such an index should not be difficult. All the necessary information is available in the packages' .cabal files, we just need to extract it and create a module name to package name index. -harendra _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe< at >haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Categories: Offsite Discussion

Can't get any memory profile data

haskell-cafe - Sat, 03/12/2016 - 6:54am
Here is the command line log of my attempt following real world haskell's memory profiling example: src λ stack exec -- ghc --version The Glorious Glasgow Haskell Compilation System, version 7.10.3 src λ cat Main.hs module Main where main :: IO () main = do let x = {-# SCC sum #-}sum [1..500] print x {-# SCC helloworld #-} putStrLn "hello world" src λ stack exec -- ghc -O2 --make Main.hs -prof -auto-all -caf-all -fforce-recomp [1 of 1] Compiling Main ( Main.hs, Main.o ) Linking Main ... src λ ./Main +RTS -hc -p 125250 hello world src λ cat Main.hp JOB "Main +RTS -hc -p" DATE "Fri Mar 11 22:51 2016" SAMPLE_UNIT "seconds" VALUE_UNIT "bytes" BEGIN_SAMPLE 0.00 END_SAMPLE 0.00 BEGIN_SAMPLE 0.00 END_SAMPLE 0.00 _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe< at >haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Categories: Offsite Discussion

CFP SBLP 2016: 20th Brazilian Symposium on ProgrammingLanguages

General haskell list - Fri, 03/11/2016 - 11:34pm
[Apologies if you receive multiple copies of this CFP] ======================================================= 20th Brazilian Symposium on Programming Languages http://cbsoft.org/sblp2016/call-for-papers The Brazilian Symposium on Programming Languages is a well-established symposium which provides a venue for researchers and practitioners interested in the fundamental principles and innovations in the design and implementation of programming languages and systems. SBLP 2016 will be held in Maringá, in the Southern region of Brazil, and will be the 20th edition of the symposium. IMPORTANT DATES Abstract submission: April 8th 2016 Paper submission: April 15th 2016 Author notification: May 27th 2016 Camera ready deadline: June 10th 2016 Authors are invited to submit original research on any relevant topic which can be either in the form of regular or short papers. TOPICS Topics of interest include, but are not limited to: - Program generation and transformation, including domain-specific languag
Categories: Incoming News

MPTC vs. constructor class

haskell-cafe - Fri, 03/11/2016 - 8:13pm
I've been messing around with type-aligned sequences for a while, and I'm currently exploring the design space of classes for them. The type-aligned package uses this: class TASequence s where tempty :: s c x x Source tsingleton :: c x y -> s c x y
Categories: Offsite Discussion

Prelude.head: empty list

haskell-cafe - Fri, 03/11/2016 - 5:56pm
In a rather large program I made some changes, and now I get the runtime error: ampersand.exe: Prelude.head: empty list Of course I know that head is partial, and should be used with care. It is used many times (we have 100+ modules in the program). Is there an elegant way to get some information about where the specific call to head is being made? Thanks! _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe< at >haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Categories: Offsite Discussion

Alternative Proxy

libraries list - Thu, 03/10/2016 - 12:49am
This is the only possible definition...so it should exist instance Alternative Proxy where empty = Proxy _ <|> _ Proxy _______________________________________________ Libraries mailing list Libraries< at >haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Categories: Offsite Discussion

Implement traverseMaybe in Data.Map, Data.IntMap, etc

libraries list - Tue, 03/08/2016 - 10:43am
As far as I know, the most general form of a function that allows traversing and filtering is: type Filter s t a b = foall f. Applicative f => (a -> f (Maybe b)) -> s -> f t In my witherable[0] package, I defined `Witherable` as a subclass of `Traversable` to provide such operation for various containers. class T.Traversable t => Witherable t where wither :: Applicative f => (a -> f (Maybe b)) -> t a -> f (t b) ... However, the `wither` for `Map` is currently inefficient because it is defined in terms of `traverse` and `mapMaybe`, so it traverses the container twice. Efficient implementation.would have to use the hidden constructors. I would like to propose adding `traverseMaybe` and `traverseMaybeWithKey` for `Data.Map`, `Data.IntMap`, and their strict variants (I'm suggesting more conservative name because wither might sound too unusual or poetic for a standard library. I like 'wither' though). A possible implementation would be like this: traverseMaybeWithKey :: Applicative f => (k -> a ->
Categories: Offsite Discussion

Add `take`/`drop`/`splitAt` to `Data.Map`/`Data.Set`

libraries list - Tue, 03/08/2016 - 2:14am
I would like to propose adding `take`/`drop`/`splitAt` to both `Data.Map` and `Data.Set` as originally requested in: https://github.com/haskell/containers/issues/135 <https://github.com/haskell/containers/issues/135> The motivation behind this proposal is three-fold: * for convenience - these functions are commonly used to implement pagination or previews of maps/sets * for type accuracy - the public API impose an unnecessary `Ord` constraint * for efficiency - these can be implemented more efficiently using the internal API Currently the only way you can implement this functionality via the public API is to use `lookupIndex`/`elemAt` + `split`. For example, one way to implement `Data.Set.take` is: take :: Ord a => Int -> Set a -> Set a take n m | n < 0 = empty | size m <= n = m | otherwise = lt where (lt, _) = split k m k = elemAt n m {-# INLINE take #-} This implementation incurs an unnecessary `Ord` constraint due to a roundabout way of computing `take`: this ext
Categories: Offsite Discussion

New gtk2hs 0.12.4 release

gtk2hs - Wed, 11/21/2012 - 12:56pm

Thanks to John Lato and Duncan Coutts for the latest bugfix release! The latest packages should be buildable on GHC 7.6, and the cairo package should behave a bit nicer in ghci on Windows. Thanks to all!

~d

Categories: Incoming News