News aggregator

Ian Ross: Non-diffusive atmospheric flow #9: speeding up KDE

Planet Haskell - Fri, 01/30/2015 - 9:21am
Non-diffusive atmospheric flow #9: speeding up KDE January 30, 2015

The Haskell kernel density estimation code in the last article does work, but it’s distressingly slow. Timing with the Unix time command (not all that accurate, but it gives a good idea of orders of magnitude) reveals that this program takes about 6.3 seconds to run. For a one-off, that’s not too bad, but in the next article, we’re going to want to run this type of KDE calculation thousands of times, in order to generate empirical distributions of null hypothesis PDF values for significance testing. So we need something faster.

It’s quite possible that the Haskell code here could be made quite a bit faster. I didn’t spend a lot of time thinking about optimising it. I originally tried a slightly different approach using a mutable matrix to accumulate the kernel values across the data points, but this turned out to be slower than very simple code shown in the previous article (something like ten times slower!). It’s clear though that this calculation is very amenable to parallelisation – the unnormalised PDF value at each point in the (θ,ϕ) grid can by calculated independently of any other grid point, and the calculation for each grid point accesses all of the data points in a very regular way.

So, let’s parallelise. I have an NVIDIA graphics card in my machine here, so it’s very tempting to do something with CUDA. If I was a real Haskell programmer, I’d use Accelerate, which is an embedded DSL for data parallel array processing that has a CUDA backend. Unfortunately, a few experiments revealed that it would take me a while to learn how to use Accelerate, which has some restrictions on how you can structure algorithms that didn’t quite fit with the way I was trying to do things. So I gave up on that.

However, I’ve been writing quite a lot of CUDA C++ recently, so I decided to use the simple FFI bindings to the CUDA runtime API in the Haskell cuda package, and to write the CUDA code itself in C++. If you’re at all familiar with CUDA C++, running kernels from Haskell turns out to be really pretty easy.

I’m not going to get into a long description of CUDA itself here. You can read all about it on the NVIDIA website or there are a couple of MOOCs that cover parallel programming with CUDA1.

Here’s the CUDA code for the KDE calculation:

#include <cuda.h> #include <cmath> // Use a regular 2.5 deg. x 2.5 deg. theta/phi grid on unit sphere. const int Nphi = 2 * 360 / 5, Ntheta = 2 * 180 / 5; // Integration steps. const double dphi = 2.0 * M_PI / Nphi, dtheta = M_PI / Ntheta; // Density kernel bandwidth. const double bandwidth = M_PI / 6.0; extern "C" __global__ void make_pdf_kernel (unsigned int D, const double * __restrict__ d_data, double *d_pdf) { unsigned int c = blockIdx.x * blockDim.x + threadIdx.x; unsigned int r = blockIdx.y * blockDim.y + threadIdx.y; double th = (0.5 + r) * dtheta, ph = c * dphi; double gx = sin(th) * cos(ph), gy = sin(th) * sin(ph), gz = cos(th); if (r > Ntheta || c > Nphi) return; double sum = 0.0; for (unsigned int i = 0; i < D; ++i) { double dx = d_data[3 * i], dy = d_data[3 * i + 1], dz = d_data[3 * i + 2]; double u = acos(dx * gx + dy * gy + dz * gz) / bandwidth; if (u < 1) sum += 1 - u * u; } d_pdf[r * Nphi + c] = sum; }

I’ll just say a couple of things about it:

  • We launch CUDA threads in two-dimensional blocks set up to cover the whole of the (θ,ϕ) grid: the row and column in the grid are calculated from the CUDA block and thread indexes in the first two lines of the kernel.

  • The main part of the computation is completely straightforward: each thread determines the coordinates of the grid point it’s working on, then loops over all the data points (which are stored flattened into the d_data vector) accumulating values of the KDE Epanechnikov kernel, finally writing the result out to the d_pdf vector (also a two-dimensional array flattened into a vector in row-major order).

  • We do almost nothing to optimise the CUDA code: this is just about the simplest CUDA implementation of this algorithm that’s possible, and it took about five minutes to write. The only “clever” thing is the declaration of the input data point array as const double * __restrict__ d_data. This little bit of magic tells the CUDA C++ compiler that the d_data array will not be aliased, contains data that will not be modified during the execution of the CUDA kernel, and is accessed in a consistent pattern across all threads running the kernel. The upshot of this is that the compiler can generate code that causes the GPU to cache this data very aggressively (actually, on more recent GPUs, in a very fast L1 cache). Since every thread accesses all of the data points stored in d_data, this can lead to a large reduction in accesses to global GPU memory. Much optimisation of GPU code comes down to figuring out ways to reduce global memory bandwidth, since global memory is much slower than the registers and local and shared memory in the multiprocessors in the GPU. The usual approach to this sort of thing is to load chunks of data into shared memory (“shared” in this context means shared between GPU threads within a single thread block), which is faster than global memory. This requires explicitly managing this loading though, which isn’t always very convenient. In contrast, telling the CUDA compiler that it can cache d_data in this way has more or less the same effect, with next to no effort.

The CUDA code is compiled to an intermediate format called PTX using the NVIDIA C++ compiler, nvcc:

nvcc -O2 -ptx -arch=compute_50 -code=sm_50

The arch and code options tell the compiler to produce code for NVIDIA devices with “compute capability” 5.0, which is what the GPU in my machine has.

Calling the CUDA code from Haskell isn’t too hard. Here, I show just the parts that are different from the previous Haskell-only approach:

main :: IO () main = do -- CUDA initialisation. CUDA.initialise [] dev <- CUDA.device 0 ctx <- CUDA.create dev [] ptx <- B.readFile "make_pdf.ptx" CUDA.JITResult time log mdl <- CUDA.loadDataEx ptx [] fun <- CUDA.getFun mdl "make_pdf_kernel" ... code omitted ... -- Convert projections to 3-D points, flatten to vector and allocate -- on device. let nData = rows projsin dataPts = SV.concat $ map projToPt $ toRows $ cmap realToFrac projsin dDataPts <- CUDA.mallocArray $ 3 * nData SV.unsafeWith dataPts $ \p -> CUDA.pokeArray (3 * nData) p dDataPts -- Calculate kernel values for each grid point/data point -- combination and accumulate into grid. dPdf <- CUDA.mallocArray $ ntheta * nphi :: IO (CUDA.DevicePtr Double) let tx = 32 ; ty = 16 blockSize = (tx, ty, 1) gridSize = ((nphi - 1) `div` tx + 1, (ntheta - 1) `div` ty + 1, 1) CUDA.launchKernel fun gridSize blockSize 0 Nothing [CUDA.IArg (fromIntegral nData), CUDA.VArg dDataPts, CUDA.VArg dPdf] CUDA.sync res <- (ntheta * nphi) SVM.unsafeWith res $ \p -> CUDA.peekArray (ntheta * nphi) dPdf p unnormv <- SV.unsafeFreeze res let unnorm = reshape nphi unnormv

First, we need to initialise the CUDA API and load our compiled module. The PTX format is just-in-time compiled to binary code for the installed GPU during this step, and we output some information about the compilation process.

Allocating memory on the GPU is done using the cuda package’s versions of functions like mallocArray, and the pokeArray function is used to transfer data from the host (i.e. the CPU) to the GPU memory. The CUDA kernel is then run using the launchKernel function, which takes arguments specifying the layout of threads and thread blocks to use (the values shown in the code above were the best I found from doing a couple of quick timing experiments), as well as the parameters to pass to the kernel function. Once the kernel invocation is finished, the results can be retrieved using the peekArray function.

This is all obviously a little bit grungy and it would definitely be nicer from an aesthetic point of view to use something like Accelerate, but if you already know CUDA C++ and are familiar with the CUDA programming model, it’s really not that hard to do this sort of mixed language CPU/GPU programming.

And is it fast? Oh yeah. Again, using the very unsophisticated approach of measuring elapsed time using the Unix time utility, we find that the CUDA version of the KDE code runs in about 0.4 seconds on my machine. That’s more than ten times faster than the Haskell only version, on a not very beefy GPU. And as I’ve said, I’ve not put any real effort into optimising the CUDA code. I’m sure it could be made to go faster. But for our purposes, this is good enough. In the next article, we’ll want to run this code 10,000 times (you’ll see why), which will take a little over an hour with the CUDA code, rather than nearly 17 hours with the Haskell-only code. For writing library code, or for more performance-critical applications, you would obviously put a lot more effort into optimisation (think about all the FFT stuff from earlier articles!), but for this kind of one-off data analysis task, there’s no benefit to spending more time. It’s easy to set off an analysis job, go for lunch and have the results ready when you come back.

  1. The Udacity course is pretty good, as is the Coursera Heterogeneous Parallel Programming course.

data-analysis haskell <script src="" type="text/javascript"></script> <script type="text/javascript"> (function () { var articleId = fyre.conv.load.makeArticleId(null); fyre.conv.load({}, [{ el: 'livefyre-comments', network: "", siteId: "290329", articleId: articleId, signed: false, collectionMeta: { articleId: articleId, url: fyre.conv.load.makeCollectionUrl(), } }], function() {}); }()); </script>
Categories: Offsite Blogs

Why doesn't cabal work like this?

Haskell on Reddit - Fri, 01/30/2015 - 6:24am

Why doesn't cabal just follow this algorithm for handling installed libraries:

  1. every time a library is needed, install the latest version of that library, that is within the range of the package you are trying to install, in a different folder (instead of updating the one already there)
  2. if a library is already installed, use that library otherwise follow step 1

You can even minimise and optimise this by analising all projects dependencies (or keeping some data structure with that information) and see which version of any library covers the largest number of requirements and install that instead of another one or see if you can safely update one that already exists without breaking installations.

Or maybe you can just base the above algorithm on the version number, so that you are sure that for every version x.y of a library you can just update it to the latest x.y.z without breaking changes for other libraries.

Where's the flaw in this reasoning? Is installing every library every time in different sandboxes really that better?

submitted by notjefff
[link] [8 comments]
Categories: Incoming News

Any suggestions on how I could make my ray tracer faster?

Haskell on Reddit - Fri, 01/30/2015 - 1:53am

This is what I've got so far. If you run Test.hs, you will notice the raytracer is pretty slow. It is very incomplete yet, mostly just the Octree and a few intersection functions. Enough to realize it is slow, though. Since I haven't really optimized anything, that's expectable. But I don't know where to go. I've tried using "-prof" to have an idea, but I get a "Could not find module" error message for pretty much anything I imported. Seems like the problem is I didn't build my libraries with profiling enabled. Considering it took me a good 2~3 days to build them, I don't have time to pass through that again so I guess that tool is out my reach right now :(

Any ideas? Also, suggestion on my code in general is welcome since this is the first time I stopped to actually try coding something in Haskell.

submitted by SrPeixinho
[link] [12 comments]
Categories: Incoming News

(GHC) The Kind of (->) seems magic

Haskell on Reddit - Fri, 01/30/2015 - 1:28am

A dialogue with GHCi (version 7.6.3) while playing with types and kinds...

Prelude> :set -XMagicHash Prelude> :m +GHC.Types Prelude GHC.Types> :info I# data Int = I# Int# -- Defined in `GHC.Types' Prelude GHC.Types> :t I# I# :: Int# -> Int Prelude GHC.Types> :k Int# Int# :: # Prelude GHC.Types> :k Int Int :: * Prelude GHC.Types> :t I# I# :: Int# -> Int Prelude GHC.Types> :k Int# -> Int Int# -> Int :: * Prelude GHC.Types> :k (->) (->) :: * -> * -> *

Huh, that's odd, how is Int# -> Int a valid type, since Int# is of kind #? Maybe # is a subkind of *...

Prelude GHC.Types> :k Maybe Maybe :: * -> * Prelude GHC.Types> :k Maybe Int# <interactive>:1:7: Expecting a lifted type, but `Int#' is unlifted In a type in a GHCi command: Maybe Int#


So the type constructor -> seems really magic... how is that?

EDIT: It gets weirder... (->) Int# Int# is a valid type, but (->) Int# isn't:

Prelude GHC.Types> :k (->) Int# <interactive>:1:6: Expecting a lifted type, but `Int#' is unlifted In a type in a GHCi command: (->) Int# Prelude GHC.Types> :k (->) Int# Int# (->) Int# Int# :: * submitted by ryani
[link] [6 comments]
Categories: Incoming News

Compiling Setup.hs with -threaded

haskell-cafe - Fri, 01/30/2015 - 12:59am
I found a weird bug at, There is an awesome tool called shellcheck: git clone and we want to run its test suite. Basically: * If you run the test suite using runghc, then it works fine. * If you compile Setup.hs to setup, then the test suite hangs. * But if you compile Setup.hs to setup using -threaded, it works? Normally I would blabber on about what I think the problem could be, but the WTF factor here is just too high. Should we (always?) be compiling Setup.hs with -threaded? Or is this a bug in the runtime or a library or REALITY ITSELF?
Categories: Offsite Discussion

Use Haskell for shell scripting

Haskell on Reddit - Fri, 01/30/2015 - 12:56am
Categories: Incoming News

Gabriel Gonzalez: Use Haskell for shell scripting

Planet Haskell - Fri, 01/30/2015 - 12:19am

Right now dynamic languages are popular in the scripting world, to the dismay of people who prefer statically typed languages for ease of maintenance.

Fortunately, Haskell is an excellent candidate for statically typed scripting for a few reasons:

  • Haskell has lightweight syntax and very little boilerplate
  • Haskell has global type inference, so all type annotations are optional
  • You can type-check and interpret Haskell scripts very rapidly
  • Haskell's function application syntax greatly resembles Bash

However, Haskell has had a poor "out-of-the-box" experience for a while, mainly due to:

  • Poor default types in the Prelude (specifically String and FilePath)
  • Useful scripting utilities being spread over a large number of libraries
  • Insufficient polish or attention to user experience (in my subjective opinion)

To solve this, I'm releasing the turtle library, which provides a slick and comprehensive interface for writing shell-like scripts in Haskell. I've also written a beginner-friendly tutorial targeted at people who don't know any Haskell.


turtle is a reimplementation of the Unix command line environment in Haskell. The best way to explain this is to show what a simple "turtle script" looks like:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import Turtle

main = do
cd "/tmp"
mkdir "test"
output "test/foo" "Hello, world!" -- Write "Hello, world!" to "test/foo"
stdout (input "test/foo") -- Stream "test/foo" to stdout
rm "test/foo"
rmdir "test"
sleep 1
die "Urk!"

If you make the above file executable, you can then run the program directly as a script:

$ chmod u+x example.hs
$ ./example.hs
Hello, world!
example.hs: user error (Urk!)

The turtle library renames a lot of existing Haskell utilities to match their Unix counterparts and places them under one import. This lets you reuse your shell scripting knowledge to get up and going quickly.

Shell compatibility

You can easily invoke an external process or shell command using proc or shell:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import Turtle

main = do
mkdir "test"
output "test/file.txt" "Hello!"
proc "tar" ["czf", "test.tar.gz", "test"] empty

-- or: shell "tar czf test.tar.gz test" empty

Even people unfamiliar with Haskell will probably understand what the above program does.


"turtle scripts" run on Windows, OS X and Linux. You can either compile scripts as native executables or interpret the scripts if you have the Haskell compiler installed.


You can build or consume streaming sources. For example, here's how you print all descendants of the /usr/lib directory in constant memory:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import Turtle

main = view (lstree "/usr/lib")

... and here's how you count the number of descendants:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import qualified Control.Foldl as Fold
import Turtle

main = do
n <- fold (lstree "/usr/lib") Fold.length
print n

... and here's how you count the number of lines in all descendant files:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import qualified Control.Foldl as Fold
import Turtle

descendantLines = do
file <- lstree "/usr/lib"
True <- liftIO (testfile file)
input file

main = do
n <- fold descendantLines Fold.length
print nException Safety

turtle ensures that all acquired resources are safely released in the face of exceptions. For example, if you acquire a temporary directory or file, turtle will ensure that it's safely deleted afterwards:

example = do
dir <- using (mktempdir "/tmp" "test")
liftIO (die "The temporary directory will still be deleted!")

However, exception safety comes at a price. turtle forces you to consume all streams in their entirety so you can't lazily consume just the initial portion of a stream. This was a tradeoff I chose to keep the API as simple as possible.


turtle supports Patterns, which are like improved regular expressions. Use Patterns as lightweight parsers to extract typed values from unstructured text:

$ ghci
>>> :set -XOverloadedStrings
>>> import Turtle
>>> data Pet = Cat | Dog deriving (Show)
>>> let pet = ("cat" *> return Cat) <|> ("dog" *> return Dog) :: Pattern Pet
>>> match pet "dog"
>>> [Dog]
>>> match (pet `sepBy` ",") "cat,dog,cat"

You can also use Patterns as arguments to commands like sed, grep, find and they do the right thing:

>>> stdout (grep (prefix "c") "cat") -- grep '^c'
>>> stdout (grep (has ("c" <|> "d")) "dog") -- grep 'cat\|dog'
>>> stdout (sed (digit *> return "!") "ABC123") -- sed 's/[[:digit:]]/!/g'

Unlike many Haskell parsers, Patterns are fully backtracking, no exceptions.


turtle supports typed printf-style string formatting:

>>> format ("I take "%d%" "%s%" arguments") 2 "typed"
"I take 2 typed arguments"

turtle even infers the number and types of arguments from the format string:

>>> :type format ("I take "%d%" "%s%" arguments")
format ("I take "%d%" "%s%" arguments") :: Text -> Int -> Text

This uses a simplified version of the Format type from the formatting library. Credit to Chris Done for the great idea.

The reason I didn't reuse the formatting library was that I spent a lot of effort keeping the types as simple as possible to improve error messages and inferred types.

Learn more

turtle doesn't try to ambitiously reinvent shell scripting. Instead, turtle just strives to be a "better Bash". Embedding shell scripts in Haskell gives you the the benefits of easy refactoring and basic sanity checking for your scripts.

You can find the turtle library on Hackage or Github. Also, turtle provides an extensive beginner-friendly tutorial targeted at people who don't know any Haskell at all.

Categories: Offsite Blogs

Haskell with s-expression syntax?

Haskell on Reddit - Thu, 01/29/2015 - 7:59pm

Hello all, I really love the succinct expressiveness Haskell gives me. I especially love the strong typing and high level abstractions Haskell exposes. I do, however, prefer the s-expression syntax of the lisp family. Syntax may seem unimportant, and the macro capabilities are usually boasted as lisp's strong suit (though I feel you don't need them as much with Haskell's higher level abstractions), the syntax is so simple and beautiful IMO.

I would like to make, or find, a lisp with Haskell's strong typing and high level abstractions. Does anyone know of any actively developed projects in this vein? Or maybe the ability to write Haskell directly as an AST?

submitted by briticus557
[link] [22 comments]
Categories: Incoming News

Is it possible to communicate with a Haskell/GHCI REPL in a client + server fashion, like with the Clojure nREPL?

Haskell on Reddit - Thu, 01/29/2015 - 6:30pm

I would like to write a program that throws strings of Haskell code at a process, and have that process evaluate the strings, just as if I were typing into a GHCI. :)

Clojure nREPL =

Cheers, Louis

submitted by MrPopinjay
[link] [7 comments]
Categories: Incoming News

Haskell and Web - Thu, 01/29/2015 - 4:15pm
Categories: Offsite Blogs

Haskell and Web - Thu, 01/29/2015 - 4:15pm
Categories: Offsite Blogs

Robert Harper: Structure and Efficiency of Computer Programs

Planet Haskell - Thu, 01/29/2015 - 3:21pm

For decades my colleague, Guy Blelloch, and I have promoted a grand synthesis of the two “theories” of computer science, combinatorial theory and logical theory.  It is only a small exaggeration to say that these two schools of thought work in isolation.  The combinatorial theorists concern themselves with efficiency, based on hypothetical translations of high-level algorithms to low-level machines, and have no useful theory of composition, the most important tool for developing large software systems.  Logical theorists concern themselves with composition, emphasizing the analysis of the properties of components of systems and how those components are combined; the heart of logic is a theory of composition (entailment).  But relatively scant attention is paid to efficiency, and, to a distressingly large extent, the situation is worsening, and not improving.

Guy and I have argued, through our separate and joint work, for the applicability of PL ideas to algorithms design, leading. for example, to the concept of adaptive programming that Umut Acar has pursued aggressively over the last dozen years.  And we have argued for the importance of cost analysis, for various measures of cost, at the level of the code that one actually writes, and not how it is compiled.  Last spring, prompted by discussions with Anindya Banerjee at NSF in the winter of 2014, I decided to write a position paper on the topic, outlining the scientific opportunities and challenges that would arise in an attempt to unify the two, disparate theories of computing.  I circulated the first draft privately in May, and revised it in July to prepare for a conference call among algorithms and PL researchers (sponsored by NSF) to find common ground and isolate key technical challenges to achieving its goals.

There are serious obstacles to be overcome if a grand synthesis of the “two theories” is to be achieved.  The first step is to get the right people together to discuss the issues and to formulate a unified vision of what are the core problems, and what are promising directions for short- and long-term research.  The position paper is not a proposal for funding, but is rather a proposal for a meeting designed to bring together two largely (but not entirely) disparate communities.  In summer of 2014 NSF hosted a three-hour long conference call among a number of researchers in both areas with a view towards hosting a workshop proposal in the near future.  Please keep an eye out for future developments.

I am grateful to Anindya Banerjee at NSF for initiating the discussion last winter that led to the paper and discussion.  And I am very happy to learn of Swarat’s endorsement; it will go a long way towards helping attract interest from both sides of the great divide.

[Update: word smithing, corrections, updating, removed discussion of cost models for later, fuller treatment]

Filed under: Research Tagged: algorithms, programming languages, research
Categories: Offsite Blogs