# News aggregator

### Jan Stolarek: Parallel Haskell challenge (also, how to make your research project fail)

In September 2012, after playing with Haskell for a couple of months, I decided to go serious with functional programming as a research topic. By that time I came across many papers and presentations about parallel Haskell, all of them saying how great Haskell is for doing parallel and concurrent computations. This looked very interesting, especially that in my PhD thesis I used an algorithm that seemed to be embarrassingly parallel in nature. I wanted to start my research by applying Haskell to something I am already familiar with so I decided to write efficient, parallel Haskell implementation of the algorithm I used in my PhD. This attempt was supposed to be a motivation to learn various approaches to parallel programming in Haskell: Repa, DPH, Accelerate and possibly some others. The task seemed simple and I estimated it should take me about 5 months.

I was wrong and I failed. After about 6 months I abandoned the project. Despite my initial optimism, upon closer examination the algorithm turned out not to be embarrassingly parallel. I could not find a good way of parallelizing it and doing things in functional setting made things even more difficult. I don’t think I will ever get back to this project so I’m putting the code on GitHub. In this post I will give a brief overview of the algorithm, discuss parallelization strategies I came up with and the state of the implementation. I hope that someone will pick it up and solve the problems I was unable to solve. Consider this a challenge problem in parallel programming in Haskell. I think that if solution is found it might be worthy a paper (unless it is something obvious that escaped me). In any case, please let me know if you’re interested in continuing my work.

Lattice structureThe algorithm I wanted to parallelize is called the “lattice structure”. It is used to compute a Discrete Wavelet Transform (DWT) of a signal1. I will describe how it works but will not go into details of why it works the way it does (if you’re interested in the gory details take a look at this paper).

Let’s begin by defining a two-point base operation:

This operations takes two floating-point values x and y as input and returns two new values x’ and y’ created by performing simple matrix multiplication. In other words:

where is a real parameter. Base operation is visualised like this:

(The idea behind base operation is almost identical as in the butterfly diagram used in Fast Fourier Transforms).

The lattice structure accepts input of even length, sends it through a series of layers and outputs a transformed signal of the same length as input. Lattice structure is organised into layers of base operations connected like this:

The number of layers may be arbitrary; the number of base operations depends on the length of input signal. Within each layer all base operations are identical, i.e. they share the same value of . Each layer is shifted by one relatively to its preceding layer. At the end of signal there is a cyclic wrap-around, as denoted by and arrows. This has to do with the edge effects. By edge effects I mean the question of what to do at the ends of a signal, where we might have less samples than required to actually perform our signal transformation (because the signal ends and the samples are missing). There are various approaches to this problem. Cyclic wrap-around performed by this structure means that a finite-length signal is in fact treated as it was an infinite, cyclic signal. This approach does not give the best results, but it is very easy to implement. I decided to use it and focus on more important issues.

Note that if we don’t need to keep the original signal the lattice structure could operate in place. This allows for a memory-efficient implementation in languages that have destructive updates. If we want to keep the original signal it is enough that the first layer copies data from old array to a new one. All other layers can operate in place on the new array.

Parallelization opportunitiesOne look at the lattice structure and you see that it is parallel – base operations within a single layer are independent of each other and can easily be processed in parallel. This approach seems very appropriate for CUDA architecture. But since I am not familiar with GPU programming I decided to begin by exploring parallelism opportunities on a standard CPU.

For CPU computations you can divide input signal into chunks containing many base operations and distribute these chunks to threads running on different cores. Repa library uses this parallelization strategy under the hood. The major problem here is that after each layer has been computed we need to synchronize threads to assemble the result. The question is whether the gains from parallelism are larger than this cost.

After some thought I came up with another parallelization strategy. Instead of synchronizing after each layer I would give each thread its own chunk of signal to propagate through all the layers and then merge the result at the end. This approach requires that each thread is given an input chunk that is slightly larger than the expected output. This results from the fact that here we will not perform cyclic wrap-around but instead we will narrow down the signal. This idea is shown in the image below:

This example assumes dividing the signal between two threads. Each thread receives an input signal of length 8 and produces output of length 4. A couple of issues arise with this approach. As you can see there is some overlap of computations between neighbouring threads, which means we will compute some base operations twice. I derived a formula to estimate amount of duplicate computations with a conclusion that in practice this issue can be completely neglected. Another issue is that the original signal has to be enlarged, because we don’t perform a wrap-around but instead expect the wrapped signal components to be part of the signal (these extra operations are marked in grey colour on the image above). This means that we need to create input vector that is longer than the original one and fill it with appropriate data. We then need to slice that input into chunks, pass each chunk to a separate thread and once all threads are finished we need to assemble the result. Chunking the input signal and assembling the results at the end are extra costs, but they allow us to avoid synchronizing threads between layers. Again, this approach might be implemented with Repa.

A third approach I came up with was a form of nested parallelism: distribute overlapping chunks to separate threads and have each thread compute base operations in parallel, e.g. by using SIMD instructions.

MethodologyMy plan was to implement various versions of the above parallelization strategies and compare their performance. When I worked in Matlab I used its profiling capabilities to get precise execution times for my code. So one of the first questions I had to answer was “how do I measure performance of my code in Haskell?” After some googling I quickly came across criterion benchmarking library. Criterion is really convenient to use because it automatically runs the benchmarked function multiple times and performs statistical analysis of the results. It also plots the results in a very accessible form.

While criterion offered me a lot of features I needed, it also raised many questions and issues. One question was whether the forcing of lazily generated benchmark input data distorts the benchmark results. It took me several days to come up with experiments that answered this question. Another issue was that of the reliability of the results. For example I observed that results can differ significantly across runs. This is of course to be expected in a multi-tasking environment. I tried to eliminate the problem by switching my Linux to single-user mode where I could disable all background services. Still, it happened that some results differed significantly across multiple runs, which definitely points out that running benchmarks is not a good way to precisely answer the question “which of the implementations is the most efficient?”. Another observation I made about criterion was that results of benchmarking functions that use FFI depend on their placement in the benchmark suite. I was not able to solve that problem and it undermined my trust in the criterion results. Later during my work I decided to benchmark not only the functions performing the Discrete Wavelet Transform but also all the smaller components that comprise them. Some of the results were impossible for me to interpret in a meaningful way. I ended up not really trusting results from criterion.

Another tool I used for measuring parallel performance was Threadscope. This nifty program visualizes CPU load during program execution as well as garbage collection and some other events like activating threads or putting them to sleep. Threadscope provided me with some insight into what is going on when I run my program. Information from it was very valuable although I couldn’t use it to get the most important information I needed for a multi-threaded code: “how much time does the OS need to start multiple threads and synchronize them later?”.

ImplementationAs already mentioned, one of my goals for this project was to learn various parallelization techniques and libraries. This resulted in implementing algorithms described above in a couple of ways. First of all I used three different approaches to handle cyclic wrap-around of the signal between the layers:

**cyclic shift**– after computing one layer perform a cyclic shift of the intermediate transformed signal. First element of the signal becomes the last, all other elements are shifted by one to the front. This is rather inefficient, especially for lists.**signal extension**– instead of doing cyclic shift extend the initial signal and then shorten it after each layer (this approach is required for the second parallelization strategy but it can be used in the first one as well). Constructing the extended signal is time consuming but once lattice structure computations are started the transition between layers becomes much faster for lists. For other data structures, like vectors, it is time consuming because my implementation creates a new, shorter signal and copies data from existing vector to a new one. Since vectors provide constant-time indexing it would be possible to avoid copying by using smarter indexing. I don’t remember why I didn’t implement that.**smart indexing**– the most efficient way of implementing cyclic wrap-around is using indexing that shifts the base operations by one on the odd layers. Obviously, to be efficient it requires a data structure that provides constant-time indexing. It requires no copying or any other modification of output data from a layer. Thus it carries no memory and execution overhead.

Now that we know how to implement cyclic wrap-around let’s focus on the actual implementations of the lattice structure. I only implemented the first parallelization strategy, i.e. the one that requires thread synchronization after each layer. I admit I don’t remember the exact reasons why I didn’t implement the signal-chunking strategy. I think I did some preliminary measurements and concluded that overhead of chunking the signal is way to big. Obviously, the strategy that was supposed to use nested parallelizm was also not implemented because it relied on the chunking strategy. So all of the code uses parallelizm within a single layer and synchronizes threads after each layer.

Below is an alphabetic list of what you will find in my source code in the Signal.Wavelet.* modules:

**Signal.Wavelet.C1**– I wanted to at least match the performance of C, so I made a sequential implementation in C (see cbits/dwt.c) and linked it into Haskell using FFI bindings. I had serious doubts that the overhead of calling C via FFI might distort the results, but luckily it turned out that it does not – see this post. This implementation uses smart indexing to perform cyclic wrap-around. It also operates in place (except for the first layer, as described earlier).**Signal.Wavelet.Eval1**– this implementation uses lists and the Eval monad. It uses cyclic shift of the input signal between layers. This implementation was not actually a serious effort. I don’t expect anything that operates on lazy lists to have decent performance in numerical computations. Surprisingly though, adding Eval turned out to be a performance killer compared to the sequential implementation on lists. I never investigated why this happens**Signal.Wavelet.Eval2**– same as Eval1, but uses signal extension instead of cyclic shift. Performance is also very poor.**Signal.Wavelet.List1**– sequential implementation on lazy lists with cyclic shift of the signal between the layers. Written as a reference implementation to test other implementations with QuickCheck.**Signal.Wavelet.List2**– same as previous, but uses signal extension. I wrote it because it was only about 10 lines of code.**Signal.Wavelet.Repa1**– parallel and sequential implementation using Repa with cyclic shift between layers. Uses unsafe Repa operations (unsafe = no bounds checking when indexing), forces each layer after it is computed and is as strict as possible.**Signal.Wavelet.Repa2**– same as previous, but uses signal extension.**Signal.Wavelet.Repa3**– this implementation uses internals of the Repa library. To make it run you need to install modified version of Repa that exposes its internal modules. In this implementation I created a new type of Repa array that represents a lattice structure. With this implementation I wanted to see if I can get better performance from Repa if I place the lattice computations inside the array representation. This implementation uses smart indexing.**Signal.Wavelet.Vector1**- this implementation is a Haskell rewrite of the C algorithm that was supposed to be my baseline. It uses mutable vectors and lots of unsafe operations. The code is ugly – it is in fact an imperative algorithm written in a functional language.

In most of the above implementations I tried to write my code in a way that is idiomatic to functional languages. After all this is what the Haskell propaganda advertised – parallelism (almost) for free! The exceptions are Repa3 and Vector1 implementations.

ResultsCriterion tests each of the above implementations by feeding it a vector containing 16384 elements and then performing a 6 layer transformation. Each implementation is benchmarked 100 times. Based on these 100 runs criterion computes average runtime, standard deviation, influence of outlying results on the average and a few more things like plotting the results. Below are the benchmarking results on Intel i7 M620 CPU using two cores (click to enlarge):

“DWT” prefix of all the benchmarks denotes the forward DWT. There is also the IDWT (inverse DWT) but the results are similar so I elided them. “Seq” suffix denotes sequential implementation, “Par” suffix denotes parallel implementation. As you can see there are no results for the Eval* implementations. The reason is that they are so slow that differences between other implementations become invisible on the bar chart.

The results are interesting. First of all the C implementation is really fast. The only Haskell implementation that comes close to it is Vector1. Too bad the code of Vector1 relies on tons of unsafe operations and isn’t written in functional style at all. All Repa implementations are noticeably slower. The interesting part is that for Repa1 and Repa2 using parallelism slows down the execution time by a factor of 2. For some reason this is not the case for Repa3, where parallelism improves performance. Sadly, Repa3 is as slow as implementation that uses lazy lists.

The detailed results, which I’m not presenting here because there’s a lot of them, raise more questions. For example in one of the benchmarks run on a slower machine most of the running times for the Repa1 implementation were around 3.25ms. But there was one that was only around 1ms. What to make of such a result? Were all the runs, except for this one, slowed down by some background process? Is it some mysterious caching effect? Or is it just some criterion glitch? There were many such questions where I wasn’t able to figure out the answer by looking at the criterion results.

There are more benchmarks in the sources – see the benchmark suite file.

Mistakes and other issuesFrom a time perspective I can identify several mistakes that I have made that eventually lead to a failure of this project. Firstly, I think that focusing on CPU implementations instead of GPU was wrong. My plan was to quickly deal with the CPU implementations, which I thought I knew how to do, and then figure out how to implement these algorithms on a GPU. However, the CPU implementation turned out to be much slower than I expected and I spent a lot of time trying to actually make my CPU code faster. In the end I never even attempted a GPU implementation.

An important theoretical issue that I should have addressed early in the project is how big input signal do I need to benefit from parallelism. Parallelism based on multiple threads comes with a cost of launching and synchronizing threads. Given that Repa implementations do that for each layer I really pay a lot of extra cost. As you’ve seen my benchmarks use vectors with 16K elements. The problem is that this seems not enough to benefit from parallelism and at the same time it is much more than encountered in typical real-world applications of DWT. So perhaps there is no point in parallelizing the lattice structure, other than using SIMD instructions?

I think the main cause why this project failed is that I did not have sufficient knowledge of parallelism. I’ve read several papers on Repa and DPH and thought that I know enough to implement parallel version of an algorithm I am familiar with. I struggled to understand benchmark results that I got from criterion but in hindsight I think this was not a good approach. The right thing to do was looking at the generated assembly, something that I did not know how to do at that time. I should also have a deeper understanding of hardware and thread handling by the operating system. As a side note, I think this shows that parallelism is not really for free and still requires some arcane knowledge from the programmer. I guess there is a lot to do in the research on parallelism in Haskell.

SummaryI have undertaken a project that seemed like a relatively simple task but it ended up as a failure. This was not the first and probably not the last time in my career – it’s just the way science is. I think the major factor that contributed to failure was me not realizing that I have insufficient knowledge. But I don’t consider my work on this to be a wasted effort. I learned how to use FFI and how to benchmark and test my code. This in turn lead to many posts on this blog.

What remains is an unanswered question: how to implement an efficient, parallel lattice structure in Haskell? I hope thanks to this post and putting my code on Github someone will answer this question.

AcknowledgementsDuring my work on this project I contacted Ben Lippmeier, author of the Repa library. Ben helped me realize some things that I have missed in my work. That sped up my decision to abandon this project and I thank Ben for that.

**UPDATE (28/05/2014)**

One of the comments below suggests it would be interesting to see performance of parallel implementation in C++ or Rust. In fact I have attempted a parallel implementation in C using SSE3 SIMD instructions. I undertook my effort a few months after giving up on the project with a sole purpose of seeing whether the C implementation can be made faster. I haven’t finished that attempt so I have not described it in the original post, but since the subject was raised I’ll briefly describe what I have accomplished. The idea was to modify the C1 implementation and rewrite the computations in the assembly language using Intel intrinsics. That turned out to be quite simple although at one point I’ve run into some unexpected segmentation faults that I was unable to debug. Since this was taking more time than I was planning to dedicate to this experiment I gave up. I tested this implementation just now and surprisingly there are no segfaults. Still, the code is incomplete – signal wrap-around in the odd layers is not implemented and by eye-balling the results I guess that there might be some other bugs in the implementation. I’ve run the benchmarks and the results show that using SSE3 speeds up the C implementation by about 25%-30%, which is quite a lot. Implementing signal wrap-around will certainly slow the implementation down, but I still think that the performance gain will remain significant. I pushed my work to the sse3 branch. Feel free to finish the implementation.

- Orthogonal transform, to be more precise. It is possible to construct lattice structures for biorthogonal wavelets, but that is well beyond the scope of this post.

### An experiment with typed time

### Ken T Takusagawa: [xjrnkknh] RandT ST

Here is a brief example of combining the RandT monad transformer with the ST monad. We write a random value into an STRef, then read it. The magic function is lift :: (MonadTrans t, Monad m) => m a -> t m a .

{-# LANGUAGE ScopedTypeVariables #-}

module Main where {

import Control.Monad.Random(RandT, getRandomR, evalRandT);

import Control.Monad.ST.Lazy(ST, runST);

import System.Random(RandomGen, StdGen, mkStdGen);

import Control.Monad.Trans(lift);

import Data.STRef.Lazy(STRef, writeSTRef, readSTRef, newSTRef);

-- We could use a shortcut like this, but will not for pedagogical purposes.

type RS s a = RandT StdGen (ST s) a;

doWrite :: (RandomGen g) => STRef s Int -> RandT g (ST s) ();

doWrite v = do {

r :: Int <- getRandomR (1, 6);

lift $ writeSTRef v r;

};

foo :: (RandomGen g) => RandT g (ST s) Int;

foo = do {

v :: STRef s Int <- lift $ newSTRef 0;

doWrite v;

out :: Int <- lift $ readSTRef v;

return out;

};

runAll :: Int;

runAll = runST $ evalRandT foo $ mkStdGen 12345;

main :: IO ();

main = print runAll;

}

Here is the output, typical of the flaw in random number generation of the first sample.

6

Previously, an example of ErrorT and ST.

### Eric Kidd: Learning Middle Egyptian with Anki, slowly

Although I don't usually mention it here, one of my hobbies is learning languages. French is my strongest by far, but I've been experimenting with seeing just how slowly I can learn Middle Egyptian. Normally, I need to reach a certain minimum degree of obsession to actually make progress, but it turns out that software can help a bit, as I explain in this post on the Beeminder blog.

But when I decided to learn Egyptian, I was faced with a dilemma: I couldn't justify spending more than an hour per week on it. Hierogylphs are cool, but come on—it's a dead language. Unfortunately, it's hard to learn a language in slow motion, because two things always go wrong:

- I get distracted, and I never actually put in that hour per week…
- I forget everything I learn between lessons…

Of course, one key tool here is Anki, which clever exploits the
spacing effect of human memory. To oversimplify, if I'm forced
to recall something shortly before I would have otherwise forgotten it,
I'll remember it at least twice as long the next time. This allows
remembering things for *O(2^N)* time for *N* effort, which is a nice trick.

On a related note, I have a new toy up on GitHub: hierogloss, which extends Markdown with support for interlinear glosses rendered using JSesh:

H: z:A1*Z1 |### Design / Data Structure advice wanted

Greetings /r/haskell,

I have been hacking away at some Haskell code to evaluate some models for optimal inventory control, and I could use some design / data structure advice. I need to evaluate this math expression.

For some background, my code is structured such that I have a ProblemParams and SupplierParams data types which holds a couple of paramters which define the shape of the probability density functions, provide K, T, and M. I want to solve this problem over many parameterizations, and M will be in [2..6]. When this term is evaluated, s0 and i are fixed.

This term is a piece of a function with the following definition:

cr_t2 :: Double -> Int -> (ProblemParams, SupplierParams) -> Double cr_t2 s0 i config = sum $ map (\ri -> cr_t2_inner s0 ri i config) [1..mlim] where mlim = (_pp_m (fst config)) + 1The f*{0}, and f*{i} are probability density functions. I have a make_density_function with type:

The integral has type

cr_t2_inner :: Double -> Int -> Int -> (ProblemParams, SupplierParams) -> Double cr_t2_inner s0 ri i pp = undefinedThe idea behind cr_t2_inner is that I can generate an integrand and then integrate it. The dimension of the integrand is M+1. This is where I am running into trouble. I am using Numeric.Integration to do the integration. It provides the function trap which is defined as:

trap :: (Double -> Double) -> Double -> Double -> [Result]where the arguments are integrand, lower bound, and upper bound. To do the multidimensional integration,

ttrap f xmin xmax = (ans, err) where res = absolute 1e-6 $ parTrap f xmin xmax ans = result res err = errorEstimate res ttrap2 f y1 y2 g1 g2 = ttrap h y1 y2 where -- f ylower yupper (fn for x lower) (fn for x upper) h y = fst $ ttrap (flip f y) (g1 y) (g2 y) ttrap2_fixed f y1 y2 x1 x2 = ttrap2 f y1 y2 (const x1) (const x2) ttrap3_fixed :: (Double -> Double -> Double -> Double) -> Double -> Double -> Double -> Double -> Double -> Double -> Double ttrap3_fixed f z1 z2 y1 y2 x1 x2 = fst $ ttrap h z1 z2 where h z = fst $ ttrap2_fixed (f z) x1 x2 y1 y2and proceed to define ttrapN in terms of ttrap{N-1}, and similarly for ttrapN_fixed. The _fixed postfix denotes the idea that the bounds are fixed.

I had initially made the type of the integrand [Double] -> Double. My rationalization for this was that, since M could vary, I would make the integrand of type [Double] -> Double, and then keep track of the implicity arity of the function on my own. Then, I can use this ugly code to convert a function to one of proper arity:

static_1 :: ([Double] -> Double) -> (Double -> Double) static_1 f = f' where f' x = f [x] static_2 :: ([Double] -> Double) -> (Double -> Double -> Double) static_2 f = f' where f' x y = f [x,y] static_3 :: ([Double] -> Double) -> (Double -> Double -> Double -> Double) static_3 f = f' where f' x y z = f [x,y,z] static_4 :: ([Double] -> Double) -> (Double -> Double -> Double -> Double -> Double) static_4 f = f' where f' x y z w = f [x,y,z,w] static_5 :: ([Double] -> Double) -> (Double -> Double -> Double -> Double -> Double -> Double) static_5 f = f' where f' x y z w u = f [x,y,z,w,u] int_listfn :: ([Double] -> Double) -> [(Double, Double)] -> Double int_listfn f b | f_dim == 1 = fst $ ttrap (static_1 f) (fst (b !! 0)) (snd (b !! 0)) | f_dim == 2 = fst $ ttrap2_fixed (static_2 f) (fst (b !! 0)) (snd (b !! 0)) (fst (b !! 1)) (snd (b !! 1)) | f_dim == 3 = ttrap3_fixed (static_3 f) (fst (b !! 0)) (snd (b !! 0)) (fst (b !! 1)) (snd (b !! 1)) (fst (b !! 2)) (snd (b !! 2)) | f_dim == 4 = ttrap4_fixed (static_4 f) (fst (b !! 0)) (snd (b !! 0)) (fst (b !! 1)) (snd (b !! 1)) (fst (b !! 2)) (snd (b !! 2)) (fst (b !! 3)) (snd (b !! 3)) | f_dim == 5 = ttrap5_fixed (static_5 f) (fst (b !! 0)) (snd (b !! 0)) (fst (b !! 1)) (snd (b !! 1)) (fst (b !! 2)) (snd (b !! 2)) (fst (b !! 3)) (snd (b !! 3)) (fst (b !! 4)) (snd (b !! 4)) | f_dim == 6 = ttrap6_fixed (static_6 f) (fst (b !! 0)) (snd (b !! 0)) (fst (b !! 1)) (snd (b !! 1)) (fst (b !! 2)) (snd (b !! 2)) (fst (b !! 3)) (snd (b !! 3)) (fst (b !! 4)) (snd (b !! 4)) (fst (b !! 5)) (snd (b !! 5)) | f_dim == 7 = ttrap7_fixed (static_7 f) (fst (b !! 0)) (snd (b !! 0)) (fst (b !! 1)) (snd (b !! 1)) (fst (b !! 2)) (snd (b !! 2)) (fst (b !! 3)) (snd (b !! 3)) (fst (b !! 4)) (snd (b !! 4)) (fst (b !! 5)) (snd (b !! 5)) (fst (b !! 6)) (snd (b !! 6)) | otherwise = error "Unsupported integral size" where f_dim = length bThe idea of int_listfn is that I could pass it a function of type ([Double]->Double) and then the pseudo-arity of that function is implied by the list of (lower bound, upper bound) tuples. However,

- This seems really ugly, and not the FP / Haskell way of doing things
- For this integral, I cannot express the bounds as a fixed list of tuples, since the outer bounds depend on the inner bounds. ttrap2 shows this, but I am not sure how to generalize this.

I was given some help on Stack Overflow using TypeFamilies and FlexibleInstances to make an integration function which can operate on types (Double -> Double -> Double -> ... -> Double), but this does not solve the bounds issue.

I would really appreciate some advice on a better approach to this.

submitted by steve_dot_planetbarr[link] [2 comments]

### digitalcommons.mcmaster.ca

### digitalcommons.mcmaster.ca

### A cli fuzzy filter in Haskell. Can I get a code review?

Hey everyone, I ported Gary Bernhardt's Selecta to Haskell: https://github.com/cgag/selecth. It's a pretty cool tool, see the original selecta readme for more info on how to use it.

Porting it has been a great learning experience, but I feel like I've done a lot of things sub-optimally and was hoping I could get some advice from r/haskell. I have duplicated code with regards to closing the handle to the tty. I also need to refactor it to get the get the current tty state before I start doing my rendering and then ensure that state is restored before I exit and I wasn't sure of the best way to go about achieving that. Any feedback is appreciated, thanks.

submitted by curtisg[link] [12 comments]

### Essence of Compilation

### Haskell for OCaml programmers

### Haskell for OCaml programmers

### language-puppet: 7 Startups - part 4 - Adding an asynchronous backend

This one will be an experimental post, as I have just added a ton of new code. I did not have enough time to do this properly, so this looks like a giant tangle of STM functions now. I will probably rewrite a large part of it, but I still think the though process that led me to this could be interesting to others, so here we are.

TLDR: I wrote a lot of code, it sucks, and I will rewrite a large part of it for the next episode.

Stuff that was refactored since last timeA pair of minor items :

- I fixed the PrettyElement problem here
- I changed the type of playerActionsDialog so that it only accepts NonEmpty lists.
- I did the same for allowableActions. I also rendered the function partial by mistake, can you spot the bug ? :)

The big change came from the fact that I realized my operation types were wrong. In particular this one :

<figure class="code"> 1 AskCard :: Age -> PlayerId -> NonEmpty Card -> Message -> GameInstr Card </figure>This type seemed right for writing the game rules, and the console version. However, it does suck for a multiplayer
version, as this code will ask the
second player for his choice only after the first one has answered. This will slow down the game considerably. We should
query all players at once, and wait for their answers after that. I decided to model this as an abstract promise, *ie.* a
value container that will eventually be filled. There is a new type parameter p for the GameInstr type, along with
a new GetPromise instruction.

Now, all players are consulted at the same time, and the game then waits for them to answer (code).

This is all very abstract, but in practice things are not that simple, and the promise might not get fulfilled. One problem is a player disconnecting from a backend. One way to do this would be to make the return value of the getPromise function be an Either of some sort. But the only sensible option in the game logic would be to throw an error, so instead the interpreter’s getPromise can fail, but not the version that’s exposed to the game.

For the pure backend, the promise type is just Identity, as seen here.

An ambitious stepI decided to get a bit more ambitious for this episode. I wanted to implement code that would be able to run several games at once, over several medias at once, with player being connected on any media combination. I did not write a simple backend to start with, to get a feel of where the actual problems were, and decided to write the code top-down.

So here is what I had in mind :

So basically, there would be a “Hub” that would hold the player list, who is playing which game, and that would also run the games. Backends would interact with it to provide IO with the players. As IRC and XMPP have the same kind of communication model, they would be merged in a single “virtual backend” that would communicate with a pair of “real backends”. Now how to do that ?

Asynchronous communicationBoth backends need to listen to two kinds of events :

- Requests from the game, such as asking a specific player to choose a card.
- Instant messages from a server, a channel, or a person.

From the point of view of a game, the messages from the players are usually of no interest. It just needs them to choose a card, or an action to play from times to times. The backends, however, will need to watch out for administrative commands. This means there should be a lot of filtering.

The main problem resides in asking something to a player, and *get his answer*. An extremely naive way to go about
this would be something along the lines of:

This would be completely wrong because we are not assured the next message will be the one we are expecting. So instead, we need to implement some sort of callbacks, so that when a message arrives, the backend would check if we were expecting something from the sender, and react accordingly. This means that we need an extra thread that will just wait for messages, and handle administrative commands and callbacks. So something like :

<figure class="code"> 1 2 3 4 5 6 7 8 9 10 forkIO $ forever $ do msg <- getMessage session let pid = getPlayerId msg case callbacks ^. at pid of Just cb -> cb msg Nothing -> case getContent msg of "xxx" -> adminCommandXxx "yyy" -> adminCommandYyy _ -> return () runGame </figure>Where the game would do something like this to implement, for example, the “ask card” function:

<figure class="code"> 1 2 3 4 5 6 7 8 9 10 askCard :: PlayerId -> NonEmpty Card -> m (promise Card) askCard pid necards = do let msg = buildAskCardMessage necards sendMessage session pid msg p <- newPromise addCallback pid $ \msg -> case decodeMsg of Just x -> fulfill p x Nothing -> askAgain return p </figure> Multiple backendsSo that was cool, but what if there are multiple backends ? All of them must be able to fulfill that promise ! What I would like to do is to be able to return a promise that will be fulfilled at a later date in another part of my program. Usually, this would be something like that :

<figure class="code"> 1 2 promise :: IO a -> IO (Promise a) getPromise :: Promise a -> IO a </figure>But I decided most of my function will live in the STM (I did not document this choice, but the STM is so useful it’s a no brainer), and I wanted to write code anyway.

Because of “not invented here”, I wrote my own implementation, gave it a bad name (PubSub) and wrote buggy Monoid instances. Code is here and is wrong on many levels. It exposes this set of functions :

<figure class="code"> 1 2 3 4 5 newPubSub :: STM (PubFP e a, SubFP e a) fulfillPub :: PubFP e a -> a -> STM () failPub :: PubFP e a -> e -> STM () getResult :: SubFP e a -> STM (Either e a) addPubAction :: STM () -> PubFP e a -> PubFP e a </figure>The newPubSub gives you a pair of values, one of them you can publish to (using fulfillPub for success and failPub for failure), and one of them you can get results from (with a blocking getResult).

The name is wrong because getResult will always return the same value, so this does not behave at all like a stream, which could be implied by the PubSub name.

The addPubAction is horrible too. I only had a sketchy idea that I needed callbacks at some point when I wrote this module, and that these callbacks should be lifted as soon as possible, so probably as soon as a response is published. This is wrong because :

- The type here let you do a ton of stuff, so it’s not obvious what this function does or is supposed to be used for.
- It is not actually useful. I realized later this was not needed.

The Monoid instances suffer the same problem, as they are probably not useful. Even worse, one of them doesn’t even work !

<figure class="code"> 1 2 3 instance Monoid (PubFP e a) where mempty = PubFP (const (return ())) PubFP a `mappend` PubFP b = PubFP (a >> b) </figure>It actually used the monad instance of (->) and not STM, which, if I desugar it properly, does something like that :

<figure class="code"> 1 2 3 4 5 a >> b a >>= \_ -> b \z -> (\_ -> b) (a z) z \z -> b z b </figure>So it basically only used the last PubFP. The correct implementation for mappend should have been :

<figure class="code"> 1 2 instance Monoid (PubFP e a) where PubFP a `mappend` PubFP b = PubFP (\e -> a e >> b e) </figure> Abstracting a backendNow that I have decided to have a “hub” connecting several backends, I need to find a way to abstract them. I will need to keep a list of them, and it must be possible to add and remove them dynamically, so I need some way to identify each backend. I also need a way to tell the backend that it is to be removed, so that it can clean things up and say goodbye. Finally, I need a way to talk to a backend, and a way for the backend to interact with the hub.

Here is the set of threads I should need without undue multiplexing :

- One thread per active game. I don’t think it’s possible to combine them in a single thread due to the API I exposed, and I don’t think it’s worth worrying about this anyway.
- One thread per backend, that will listen to messages from the outside world.

That is probably all I need, but because I started writing code *before* thinking about my architecture, I introduced a
bad abstraction. I decided I would talk to the backends using a TChan, which means :

- One additional thread per backend, listening to messages from the games.

So backends are defined here. A better abstraction for backendChan :: TChan Interaction could be backendTell :: Interaction -> STM (). You might notice that the comments are talking about a set of function, which was my first shot, and which was a better idea indeed.

Abstracting communicationCommunications from the game and to the player are currently of three types :

- Asking the player what card he would like to play during normal play. This returns a PlayerAction and Exchange couple.
- Asking the player to choose between several cards. This returns a Card.
- Telling various things to the player. There is no return value expected from the player.

For all cases, we need to contact each backends and tell them to contact a player or broadcast some message.

For the first two cases we need also to set some callback machinery and wait for them to answer. The function performing this should return quickly some kind of promise object that will be filled once a player has answered.

We would like to write a function with a type looking somehow like :

<figure class="code"> 1 communicate :: TVar GlobalState -> a -> STM (Promise b) </figure>Where a is one of the three message types, and b the corresponding answer. Obviously, we can’t write such a function with this type. But we can use the TypeFamilies extension to write it :

<figure class="code"> 1 2 3 4 5 6 7 8 9 10 data IAskingAction = IAskingAction PlayerId Age (NonEmpty Card) GameState Turn data IAskingCard = IAskingCard PlayerId Age (NonEmpty Card) GameState Message data ISimpleMessage = ISimpleMessage CommunicationType GameId type family InteractionResult a :: * type instance InteractionResult IAskingAction = (PlayerAction, Exchange) type instance InteractionResult IAskingCard = Card type instance InteractionResult ISimpleMessage = () communicate :: TVar GlobalState -> a -> STM (Promise (InteractionResult a)) </figure>Now the type of a and of the Promise are statically linked, which will be useful for writing generic functions.

ConclusionThis episode was about hasty decisions and code quickly written. I was not exactly in my comfort zone with the multiple backends concept, and should probably have aimed lower to get a feel of the problem first.

I will rework all of this before the next episode, which will be about concurrency, the STM, and how to mix IO and STM actions.