News aggregator

Michael Snoyman: Call for new Stackage Curator

Planet Haskell - Sun, 12/18/2016 - 6:00pm

On behalf of the Stackage curators, I'm putting out a call for a volunteer to join the team. The Stackage curators are responsible for the day-to-day management of the Stackage project: merging PRs, managing builds, reporting bounds issues and build or test failures, and every once in a while making a judgement call on when to bump to newer package versions.

The curator team currently consists of Adam Bergmark, Dan Burton, Jens Petersen, and me. The processes we follow are documented in the curator guide. We each rotate shifts, taking one week at a time on duty. Overall, the workload is light and straightforward: usually less than 30 minutes a day.

We're putting out this call for a new volunteer since we've been running as a four-person show for a while now, and want to make sure we're getting our processes well documented enough for newcomers to easily come up to speed on the process. While we are currently asking for just one new addition to the team, if there are multiple enthusiastic candidates, we will definitely consider expanding the team further.

Joining the curator team is a great way to both contribute to the Haskell ecosystem, and to become more familiar with how things work. You'll also get an opportunity to interact (albeit in brief discussions) with many package maintainers in the Haskell community.

If you're interested in applying, please send a message to the Stackage mailing list or, if you prefer applying privately, send me an email and I'll pass the message along to the rest of the team.

Categories: Offsite Blogs

Edward Z. Yang: The problem of reusable and composable specifications

Planet Haskell - Sat, 12/17/2016 - 4:54am

It's not too hard to convince people that version bounds are poor approximation for a particular API that we depend on. What do we mean when we say >= 1.0 && < 1.1? A version bound is a proxy some set of modules and functions with some particular semantics that a library needs to be built. Version bounds are imprecise; what does a change from 1.0 to 1.1 mean? Clearly, we should instead write down the actual specification (either types or contracts) of what we need.

This all sounds like a good idea until you actually try to put it into practice, at which point you realize that version numbers had one great virtue: they're very short. Specifications, on the other hand, can get quite large: even just writing down the types of all the functions you depend on can take pages, let alone executable contracts describing more complex behavior. To make matters worse, the same function will be depended upon repeatedly; the specification must be provided in each case!

So we put on our PL hats and say, "Aha! What we need is a mechanism for reuse and composition of specifications. Something like... a language of specification!" But at this point, there is disagreement about how this language should work.

Specifications are code. If you talk to a Racketeer, they'll say, "Well, contracts are just code, and we know how to reuse and compose code!" You have primitive contracts to describe values, compose them together into contracts that describe functions, and then further compose these together to form contracts about modules. You can collect these contracts into modules and share them across your code.

There is one interesting bootstrapping problem: you're using your contracts to represent versions, but your contracts themselves live in a library, so should you version your contracts? Current thinking is that you shouldn't.

But maybe you shouldn't compose them the usual way. One of the things that stuck out to me when I was reading the frontmatter of Clojure's spec documentation is that map specs should be of keysets only, and how they deal with it.

The core principle of spec's design is that specifications for records should NOT take the form { name: string, age: int }. Instead, the specification is split into two pieces: a set of keys { name, age }, and a mapping from keys to specifications which, once registered, apply to all occurrences of a key in all map specifications. (Note that keys are all namespaced, so it is not some insane free-for-all in a global namespace.) The justification for this:

In Clojure we gain power by dynamically composing, merging and building up maps. We routinely deal with optional and partial data, data produced by unreliable external sources, dynamic queries etc. These maps represent various sets, subsets, intersections and unions of the same keys, and in general ought to have the same semantic for the same key wherever it is used. Defining specifications of every subset/union/intersection, and then redundantly stating the semantic of each key is both an antipattern and unworkable in the most dynamic cases.

Back to the land of types. Contracts can do all this because they are code, and we know how to reuse code. But in (non-dependently) typed languages, the language of types tends to be far more impoverished than than the language of values. To take Backpack as an (unusually expressive) example, the only operations we can perform on signatures is to define them (with full definitions for types) and to merge them together. So Backpack signatures run head long into the redundancy problem identified by spec: because the signature of a module includes the signatures of its functions, you end up having to repeat these function signatures whenever you write slightly different iterations of a module.

To adopt the Clojure model, you would have to write a separate signature per module (each in their own package), and then have users combine them together by adding a build-depends on every signature they wanted to use:

-- In Queue-push package signature Queue where data Queue a push :: a -> Queue a -> Queue a -- In Queue-pop package signature Queue where data Queue a pop :: Queue a -> Maybe (Queue a, a) -- In Queue-length package signature Queue where data Queue a length :: Queue a -> Int -- Putting them together (note that Queue is defined -- in each signature; mix-in linking merges these -- abstract data types together) build-depends: Queue-push, Queue-pop, Queue-length

In our current implementation of Backpack, this is kind of insane: to write the specification for a module with a hundred methods, you'd need a hundred packages. The ability to concisely define multiple public libraries in a single package might help but this involves design that doesn't exist yet. (Perhaps the cure is worse than the disease. The package manager-compiler stratification rears its ugly head again!) (Note to self: signature packages ought to be treated specially; they really shouldn't be built when you instantiate them.)

Conclusions. A lot of my thinking here did not crystallize until I started reading about how dynamic languages like Clojure were grappling with the specification problem: I think this just goes to show how much we can learn by paying attention to other systems, even if their context is quite different. (If Clojure believed in data abstraction, I think they could learn a thing or two from how Backpack mix-in links abstract data declarations.)

In Clojure, the inability to reuse specs is a deal breaker which lead them to spec's current design. In Haskell, the inability to reuse type signatures flirts on the edge of unusability: types are just short enough and copy-pasteable enough to be tolerable. Documentation for these types, less so; this is what lead me down my search for better mechanisms for signature reuse.

Although Backpack's current design is "good enough" to get things done, I still wonder if we can't do something better. One tempting option is to allow for downstream signatures to selectively pick out certain functions from a larger signature file to add to their requirements. But if you require Queue.push, you had better also require Queue.Queue (without which, the type of push cannot even be stated: the avoidance problem); this could lead to a great deal of mystery as to what exactly is required in the end. Food for thought.

Categories: Offsite Blogs

Edward Z. Yang: Thoughts about Spec-ulation (Rich Hickey)

Planet Haskell - Fri, 12/16/2016 - 6:26pm

Rich Hickey recently gave a keynote at Clojure/conj 2016, meditating on the problems of versioning, specification and backwards compatibility in language ecosystems. In it, Rich considers the "extremist" view, what if we built a language ecosystem, where you never, ever broke backwards compatibility.

A large portion of the talk is spent grappling with the ramifications of this perspective. For example:

  1. Suppose you want to make a backwards-compatibility breaking change to a function. Don't mutate the function, Richard says, give the function another name.
  2. OK, but how about if there is some systematic change you need to apply to many functions? That's still not an excuse: create a new namespace, and put all the functions there.
  3. What if there's a function you really don't like, and you really want to get rid of it? No, don't remove it, create a new namespace with that function absent.
  4. Does this sound like a lot of work to remove things? Yeah. So don't remove things!

In general, Rich wants us to avoid breakage by turning all changes into accretion, where the old and new can coexist. "We need to bring functional programming [immutability] to the library ecosystem," he says, "dependency hell is just mutability hell." And to do this, there need to be tools for you to make a commitment to what it is that a library provides and requires, and not accidentally breaking this commitment when you release new versions of your software.

He says a lot more in the talk, so I encourage you to give it a watch if you want to hear the whole picture.

In general, I'm in favor of this line of thinking, because my feeling is that a large amount of breakage associated with software change that is just a product of negligence; breakage not for any good reason, breakage that could have been avoided if there was a little more help from tooling.

That being said, I do have some thoughts about topics that are not so prominently featured in his talk.

Accretion is not a silver bullet... if you believe in data hiding. In his talk, Rich implies that backwards compatibility can be maintained simply by committing not to "remove things". As a Haskeller, this sounds obviously false to me: if I change the internal representation of some abstract type (or even the internal invariants), I cannot just load up both old and new copies of the library and expect to pass values of this type between the two. Indeed, the typechecker won't even let you do this even if the representation hasn't changed.

But, at least for Clojure, I think Rich is right. The reason is this: Clojure doesn't believe data hiding! The prevailing style of Clojure code is that data types consist of immutable records with public fields that are passed around. And so a change to the representation of the data is a possibly a breaking change; non-breaking representation changes are simply not done. (I suspect a similar ethos explains why duplicated dependencies in node.js work as well as they do.)

I am not sure how I feel about this. I am personally a big believer in data abstraction, but I often admire the pragmatics of "everything is a map". (I tweeted about this earlier today, which provoked some thoughtful discussion.)

Harmful APIs. At several points in the talk, Rich makes fun of developers who are obsessed with taking away features from their users. ("I hate this function. I hate it, I hate it, I hate that people call it, I just want it out of my life.") This downplays the very real, very important reasons why infinite backwards compatibility has been harmful to the software we write today.

One need look no further than the systems with decades of backwards compatibility that Rich cites: the Unix APIs, Java and HTML. In all these cases, backwards compatibility has lead to harmful APIs sticking around far longer than they should: strncpy, gets, legacy parsers of HTML (XSS), Java antipatterns, etc. And there are examples galore in Android, C libraries, everywhere.

In my opinion, library authors should design APIs in such a way that it is easy to do the right thing, and hard to do the wrong thing. And yes, that means sometimes that means you that you need to stop people from using insecure or easy-to-get wrong library calls.

Semantic versioning doesn't cause cascading version bumps, lack of version ranges is the cause. In the slide "Do deps force Versioning?", Rich describe a problem in the Clojure ecosystem which is that, when following semantic versioning, a new release of a package often causes cascading version bumps in the system.

While the problem of cascading version bumps is a real question that applies to semantic versioning in general, the "cascading version bumps" Rich is referring to in the Clojure ecosystem stem from a much more mundane source: best practices is to specify a specific version of a dependency in your package metadata. When a new version of a dependency comes out, you need to bump the version of a package so that you can update the recorded version of the dependency... and so forth.

I'm not saying that Clojure is wrong for doing things this way (version ranges have their own challenges), but in his talk Rich implies that this is a failure of semantic versioning... which it's not. If you use version ranges and aren't in the habit of reexporting APIs from your dependencies, updating the version range of a dependency is not a breaking change. If you have a solver that picks a single copy of a library for the entire application, you can even expose types from your dependency in your API.

Overall, I am glad that Clojure is thinking about how to put backwards compatibility first and foremost: often, it is in the most extreme applications of a principle that we learn the most. Is it the end of the story? No; but I hope that all languages continue slowly moving towards explicit specifications and tooling to help you live up to your promises.

Categories: Offsite Blogs

Stroustrup's Rule and Layering Over Time

Lambda the Ultimate - Fri, 12/16/2016 - 2:25am

Dave Herman is the voice of the oppressed: syntax is important, contrary to what you have been told!

To illustrate he discusses what he calls Stroustrup's Rule:

  • For new features, people insist on LOUD explicit syntax.
  • For established features, people want terse notation.

Categories: Offsite Discussion

Do Be Do Be Do

Lambda the Ultimate - Thu, 12/15/2016 - 3:20pm

Monads and algebraic effects are general concepts that give a definition of what a "side-effect" can be: an instance of monad, or an instance of algebraic effect, is a specific realization of a side-effect. While most programming languages provide a fixed family of built-in side-effects, monads or algebraic effects give a structured way to introduce a new notion of effect as a library.

A recent avenue of programming language research is how to locally define several realizations of the same effect interface/signature. There may be several valid notions of "state" or "non-determinism" or "probabilistic choice", and different parts of a program may not require the same realization of those -- a typical use-case would be mocking an interaction with the outside world, for example. Can we let users locally define a new interpretation of an effect, or write code that is generic over the specific interpretation? There are several existing approaches, such as monad transformer stacks, free monads interpreters, monad reification and, lately, effect handlers, as proposed in the programming language Eff.

Frank, presented in the publication below, is a new language with user-defined effects that makes effect handling a natural part of basic functional programming, instead of a separate, advanced feature. It is a significant advance in language design, simplifying effectful programming. Functions, called operators, do not just start computing a result from the value of their arguments, they interact with the computation of those arguments by having the opportunity to handle any side-effects arising during their evaluation. It feels like a new programming style, a new calling convention that blends call-by-value and effect handling -- Sam Lindley suggested the name call-by-handling.

Frank also proposes a new type-and-effect system that corresponds to this new programming style. Operators handle some of the effects raised by their arguments, and silently forward the rest to the computation context; their argument types indicate which effects they handle. In other words, the static information carried by an argument types is the delta/increment between the effects permitted by the ambient computation and the effects of evaluating this argument. Frank calls this an adjustment over the ambient ability. This delta/increment style results in lightweight types for operators that can be used in different contexts (a form of effect polymorphism) without explicit quantification over effect variables. This design takes part in a research conversation on how to make type-and-effect systems usable, which is the major roadblock for their wider adoption -- Koka and Links are also promising in that regard, and had to explore elaborate conventions to elide their polymorphic variables.

Another important part of Frank's type-system design is the explicit separation between values that are and computations that do. Theoretical works have long made this distinction (for example with Call-By-Push-Value), but programmers are offered the dichotomy of either having only effectful expressions or expressing all computations as values (Haskell's indirect style). Frank puts that distinction in the hands of the user -- this is different from distinguishing pure from impure computations, as done in F* or WhyML.

If you wish to play with the language, a prototype implementation is available.


Do Be Do Be Do (arXiv)
Sam Lindley, Conor McBride, Craig McLaughlin
2017

We explore the design and implementation of Frank, a strict functional programming language with a bidirectional effect type system designed from the ground up around a novel variant of Plotkin and Pretnar’s effect handler abstraction.

Effect handlers provide an abstraction for modular effectful programming: a handler acts as an interpreter for a collection of commands whose interfaces are statically tracked by the type system. However, Frank eliminates the need for an additional effect handling construct by generalising the basic mechanism of functional abstraction itself. A function is simply the special case of a Frank operator that interprets no commands. Moreover, Frank’s operators can be multihandlers which simultaneously interpret commands from several sources at once, without disturbing the direct style of functional programming with values.

Effect typing in Frank employs a novel form of effect polymorphism which avoid mentioning effect variables in source code. This is achieved by propagating an ambient ability inwards, rather than accumulating unions of potential effects outwards.

We introduce Frank by example, and then give a formal account of the Frank type system and its semantics. We introduce Core Frank by elaborating Frank operators into functions, case expressions, and unary handlers, and then give a sound small-step operational semantics for Core Frank.

Programming with effects and handlers is in its infancy. We contribute an exploration of future possibilities, particularly in combination with other forms of rich type system.

Categories: Offsite Discussion

Philip Wadler: Option A vs B: Decision Time

Planet Haskell - Thu, 12/15/2016 - 1:35pm

Tomorrow Edinburgh City Council will decide between Options A and B for the East-West Cycle route, after deferring a decision last September.  Some of the recent coverage:
  • A visualisation of Roseburn Option A (above).
  • A comparison of road layout, current against Option A (below).
  • A letter in the Edinburgh Evening News from East-West nemesis Pete Gregson.
  • A letter from Transport Committee head Leslie Hinds, rebutting the previous letter.
  • A two-page spread in the Edinburgh Evening News.
  • A blog post from Daisy Narayanan of Sustrans.
Daisy's post hit the mark:
We have seen narratives that create an ‘us’ and ‘them’ – pitting ‘motorists’ against ‘cyclists’ against ‘pedestrians’. With such projects, it is hugely disheartening to see what should have been a force for positive change become a focus for anger. It is equally disheartening to see strong evidence and the policies of the Scottish Government which support a more active, greener Scotland being undermined by such opposition.

In darker moments, I have been tempted to draw parallels to the post-fact world that we seem to inhabit at present. Previously:
  Option A vs B: Kicked into the long grass  Option A: Think about the children
  Roseburn to Leith Walk A vs B: time to act!
  Ride the Route in support of Option A
Categories: Offsite Blogs

Mark Jason Dominus: Let's decipher a thousand-year-old magic square

Planet Haskell - Thu, 12/15/2016 - 3:36am

The Parshvanatha temple in Madhya Pradesh, India was built around 1,050 years ago. Carved at its entrance is this magic square:

The digit signs have changed in the past thousand years, but it's a quick and fun puzzle to figure out what they mean using only the information that this is, in fact, a magic square.

A solution follows. No peeking until you've tried it yourself!

There are 9 one-digit entries
and 7 two-digit entries
so we can guess that the entries are the numbers 1 through 16, as is usual, and the magic sum is 34. The appears in the same position in all the two-digit numbers, so it's the digit 1. The other digit of the numeral is , and this must be zero. If it were otherwise, it would appear on its own, as does for example the from or the from .

It is tempting to imagine that is 4. But we can see it's not so. Adding up the rightmost column, we get

+ + + =
+ 11 + + =
(10 + ) + 11 + + = 34,

so that must be an odd number. We know it isn't 1 (because is 1), and it can't be 7 or 9 because appears in the bottom row and there is no 17 or 19. So must be 3 or 5.

Now if were 3, then would be 13, and the third column would be

+ + + =
1 + + 10 + 13 = 34,

and then would be 10, which is too big. So must be 5, and this means that is 4 and is 8. ( appears only a as a single-digit numeral, which is consistent with it being 8.)

The top row has

+ + + =
+ + 1 + 14 =
+ (10 + ) + 1 + 14 = 34

so that + = 9. only appears as a single digit and we already used 8 so must be 7 or 9. But 9 is too big, so it must be 7, and then is 2.

is the only remaining unknown single-digit numeral, and we already know 7 and 8, so is 9. The leftmost column tells us that is 16, and the last two entries, and are easily discovered to be 13 and 3. The decoded square is:

712114 213811 163105 96154

I like that people look at the right-hand column and immediately see 18 + 11 + 4 + 8 but it's actually 14 + 11 + 5 + 4.

This is an extra-special magic square: not only do the ten rows, columns, and diagonals all add up to 34, so do all the four-cell subsquares, so do any four squares arranged symmetrically about the center, and so do all the broken diagonals that you get by wrapping around at the edges.

[ Addendum: It has come to my attention that the digit symbols in the magic square are not too different from the current forms of the digit symbols in the Gujarati script. ]

[ Addendum 20161217: The temple is not very close to Gujarat or to the area in which Gujarati is common, so I guess that the digit symbols in Indian languages have evolved in the past thousand years, with the Gujarati versions remaining closest to the ancient forms, or else perhaps Gujarati was spoken more widely a thousand years ago. I would be interested to hear about this from someone who knows. ]

Categories: Offsite Blogs

Philip Wadler: Roseburn to Leith Walk A vs B: time to act!

Planet Haskell - Wed, 12/14/2016 - 10:16am
On 2 August, I attended a meeting in Roseburn organised by those opposed to the new cycleway planned by the city. Local shopkeepers fear they will see a reduction in business, unaware this is a common cycling fallacy: study after study has shown that adding cycleways increases business, not the reverse, because pedestrians and cyclists find the area more attractive.

Feelings in Roseburn run strong. The locals don't trust the council: who can blame them after the fiasco over trams? But the leaders of the campaign are adept at cherry picking statistics, and, sadly, neither side was listening to the other.

On 30 August, the Edinburgh Council Transport and Environment Committee will decide between two options for the cycle route, A and B. Route A is direct. Route B goes round the houses, adding substantial time and rendering the whole route less attractive. If B is built, the opportunity to shift the area away from cars, to make it a more pleasant place to be and draw more business from those travelling by foot, bus, and cycle, goes out the window.

Locals like neither A nor B, but in a spirit of compromise the Transport and Environment Committee may opt for B. This will be a disaster, as route B will be far less likely to draw people out of their cars and onto their cycles, undermining Edinburgh's ambitious programme to attract more people to cycling before it even gets off the ground.

Investing in cycling infrastructure can make an enormous difference. Scotland suffers 2000 deaths per year due to pollution, and 2500 deaths per year due to inactivity. The original proposal for the cycleway estimates benefits of £14.5M over ten years (largely from improved health of those attracted to cycling) vs a cost of £5.7M, a staggering 3.3x return on investment. Katie Cycles to School is a brilliant video from Pedal on Parliament that drives home how investment in cycling will improve lives for cyclists and non-cyclists alike.

Want more detail? Much has been written on the issues.
  Roseburn Cycle Route: Evidence-based local community support.
  Conviction Needed.

The Transport Committee will need determination to carry the plan through to a successful conclusion. This is make or break: will Edinburgh be a city for cars or a city for people? Please write to your councillors and the transport and environment committee to let them know your views.


Previously
Roseburn to Leith Walk Cycleway: A vs B
Roseburn to Leith Walk Cycleway: the websiteRoseburn to Leith Walk

Subsequently:
Ride the Route in support of Option A
Option A: Think about the children
Categories: Offsite Blogs

Michael Snoyman: An extra benefit of open sourcing

Planet Haskell - Mon, 12/12/2016 - 6:00pm

This isn't any deep thought, and I'm sure others have mentioned it before. But I haven't seen it called out explicitly, so I thought it was worth getting it down.

Recently I was working on a customer project which required a specific feature (generate a Docker image with some libraries precompiled into it). I'll probably blog more about the specific case later, and give credit at that time to the company that gave permission for the code to be open sourced.

It turns out this is a problem that various FP Complete engineers have solved for customers (and internal purposes) a few times already. Creating a single open-source tool that can be shared among projects is a clear win, and plays to all the strengths of open source software. (And in this case, the initial version was really simple to implement, so it was almost a no brainer.)

Not long after I released that first version, I needed to update some Docker image build code for a different customer, who until now had been using a custom solution. So I moved them over to the new tool, added some features that they needed, and got to the goal of a working Docker image quicker than expected. Yay code sharing! And now others can take advantage of this work, and contribute patches that both projects using it will be able to take advantage of.

However, these are all the standard benefits of open sourcing. In this process, I rediscovered something I've seen happen multiple times:

When you're forced to separate out a tool or library, you create a more general solution, and make your code more maintainable in the long run.

When you write a "throw-away" tool or a "internal" library, odds are you won't think very hard about an extensible interface. You may embed assumptions into its design. And then the code will sit in a closed-source codebase for months or years, likely without anyone touching it in the interim. When it turns out one of your assumptions was wrong, or the interface needs to be extended, it's often times much harder than updating a general purpose tool or library.

That's not to say that everything that can be generalized should be generalized and open sourced. There are some thing which are so specific to a project that it's unlikely that any other use case will exist. Or that the cognitive overhead of figuring out a good interface is simply not warranted.

But for those sweet-spot cases where the overhead of doing something general isn't too high, you have the prerogative to open source, and there's likely at least one other person or project in the world who can use it, you'll often thank yourself in the future for having taken out the time to open source it.

Categories: Offsite Blogs

Neil Mitchell: New XML Parser, Hexml

Planet Haskell - Mon, 12/12/2016 - 4:45pm

Summary: I've released a new Haskell library, Hexml, which is an incomplete-but-fast XML parser.

I've just released Hexml, a new C/Haskell library for DOM-style XML parsing that is fast, but incomplete. To unpack that a bit:

  • Hexml is an XML parser that you give a string representing an XML document, it parses that string, and returns either a parse error or a representation of that document. Once you have the document, you can get the child nodes/attributes, walk around the document, and extract the text.

  • Hexml is really a C library, which has been designed to be easy to wrap in Haskell, and then a Haskell wrapper on top. It should be easy to use Hexml directly from C if desired.

  • Hexml has been designed for speed. In the very limited benchmarks I've done it is typically just over 2x faster at parsing than Pugixml, where Pugixml is the gold standard for fast XML DOM parsers. In my uses it has turned XML parsing from a bottleneck to an irrelevance, so it works for me.

  • To gain that speed, Hexml cheats. Primarily it doesn't do entity expansion, so &amp; remains as &amp; in the output. It also doesn't handle CData sections (but that's because I'm lazy) and comment locations are not remembered. It also doesn't deal with most of the XML standard, ignoring the DOCTYPE stuff.

If you want a more robust version of Hexml then the Haskell pugixml binding on Hackage is a reasonable place to start, but be warned that it has memory issues, that can cause segfaults. It also requires C++ which makes use through GHCi more challenging.

Speed techniques

To make Hexml fast I first read the chapter on fast parsing with Pugixml, and stole all those techniques. After that, I introduced a number of my own.

  • I only work on UTF8, which for the bits of UTF8 I care about, is the same as ASCII - I don't need to do any character decoding.

  • Since I don't do entity expansion, all strings are available in the source document, so everything simply provides offsets into the input string. In the Haskell API I use constant-time bytestring slices into the source string to present a nice API.

  • The memory model for a document is an array of attributes, an array of nodes, and a root node from the list of nodes. To make sure that scanning a document is fast, each node describes their attributes and direct child nodes in terms of a start and length within the attribute and node arrays. For example, the root node might have attributes 1..5 in the attribute array, and direct children 4..19 in the node array. When scanning the child nodes there are no linked-list operations and everything is cache friendly.

  • To keep the memory compact for attributes, I just have an array and reallocate/copy as necessary. By always doubling the number of attributes on exhaustion I ensure a worst-case of 1-copy per attribute on average.

  • To keep the memory compact for nodes is a bit more complex, as the direct child nodes are not necessarily allocated consecutively, as child nodes may themselves have child nodes. The solution is to have an array of nodes, with contiguous allocation of used child nodes starting at the beginning. To ensure the child nodes are continguous I first put the nodes at the end of the array, then copy them after a child is complete -- in effect using the end of the array as a stack. By always doubling the number of nodes on exhaustion I ensure a worst-case of 2-copies per node on average.

  • When parsing the text in the body of a document, since I don't care about &, the only character that is of any interest is <. That allows me to process much of the document with the highly-optimised memchr.

  • I initially allocate a single buffer that contains the document, a small number of attributes and a small number of nodes, in a single call to malloc. If more attributes/nodes are required they allocate a fresh buffer and just ignore the initially provided one. That ensures that for small documents they don't pay for multiple malloc calls, at the cost of wasting the initial attribute/node allocation on larger documents (which are more memory heavy anyway - so it doesn't matter).

  • I'm pretty sure Hexml could be optimised further. Specifically, I have a recursive descent parser, and it should be a single function with goto. I also process some characters multiple times, mostly to ensure predictable abstraction barriers around the parsing functions, but that could be elimiated with a goto-based approach.

Categories: Offsite Blogs

Neil Mitchell: Installing the Haskell Network library on Windows

Planet Haskell - Mon, 12/12/2016 - 4:29pm

Summary: This post describes how to install the Haskell network library on Windows, again.

I recently bought a new computer, and tried to install GHC 8.0.1 then upgrade the network library using Cabal. As I have come to expect, it didn't work. Using Git Bash, I got the error:

$ cabal install network-2.6.3.1
Resolving dependencies...
Configuring network-2.6.3.1...
Failed to install network-2.6.3.1
Build log ( C:\Users\Neil\AppData\Roaming\cabal\logs\network-2.6.3.1.log ):
Configuring network-2.6.3.1...
configure: WARNING: unrecognized options: --with-compiler
checking for gcc... C:\ghc\GHC-80~1.1┼║
checking whether the C compiler works... no
configure: error: in `C:/Neil':
configure: error: C compiler cannot create executables
See `config.log' for more details
cabal: Leaving directory '.'
cabal.exe: Error: some packages failed to install:
old-time-1.1.0.3 failed during the configure step. The exception was:
ExitFailure 77

Running -v3 shows the CC variable is being set to C:\ghc\GHC-80~1.1┼║, which looks like a buffer corruption or encoding issue. I tried my previous solution, but it didn't work. My new solution is:

$ cabal unpack network-2.6.3.1
$ cd network-2.6.3.1
$ cabal configure
... fails with a similar error to above ...
$ sh ./configure
$ cabal build
$ cabal copy
$ cabal register

I had to repeat the same pattern for the latest version of old-time, and the same pattern worked.

Another way that works is to use Stack.


Categories: Offsite Blogs

Mark Jason Dominus: Another Git catastrophe cleaned up

Planet Haskell - Mon, 12/12/2016 - 2:45pm

My co-worker X had been collaborating with a front-end designer on a very large change, consisting of about 406 commits in total. The sum of the changes was to add 18 new files of code to implement the back end of the new system, and also to implement the front end, a multitude of additions to both new and already-existing files. Some of the 406 commits modified just the 18 back-end files, some modified just the front-end files, and many modified both.

X decided to merge and deploy just the back-end changes, and then, once that was done and appeared successful, to merge the remaining front-end changes.

His path to merging the back-end changes was unorthodox: he checked out the current master, and then, knowing that the back-end changes were isolated in 18 entirely new files, did

git checkout topic-branch -- new-file-1 new-file-2 … new-file-18

He then added the 18 files to the repo, committed them, and published the resulting commit on master. In due course this was deployed to production without incident.

The next day he wanted to go ahead and merge the front-end changes, but he found himself in “a bit of a pickle”. The merge didn't go forward cleanly, perhaps because of other changes that had been made to master in the meantime. And trying to rebase the branch onto the new master was a complete failure. Many of those 406 commits included various edits to the 18 back-end files that no longer made sense now that the finished versions of those files were in the master branch he was trying to rebase onto.

So the problem is: how to land the rest of the changes in those 406 commits, preferably without losing the commit history and messages.

The easiest strategy in a case like this is usually to back in time: If the problem was caused by the unorthodox checkout-add-commit, then reset master to the point before that happened and try doing it a different way. That strategy wasn't available because X had already published the master with his back-end files, and a hundred other programmers had copies of them.

The way I eventually proceeded was to rebase the 406-commit work branch onto the current master, but to tell Git meantime that conflicts in the 18 back-end files should be ignored, because the version of those files on the master branch was already perfect.

Merge drivers

There's no direct way to tell Git to ignore merge conflicts in exactly 18 files, but there is a hack you can use to get the same effect. The repo can contain a .gitattributes file that lets you specify certain per-file options. For example, you can use .gitattributes to say that the files in a certain directory are text, that when they are checked out the line terminators should be converted to whatever the local machine's line terminator convention is, and they should be converted back to NLs when changes are committed.

Some of the per-file attributes control how merge conflicts are resolved. We were already using this feature for a certain frequently-edited file that was a list of processes to be performed in a certain order:

do A then do B

Often different people would simultaneously add different lines to the end of this file:

# Person X's change: do A then do B then do X

# Person Y's change: do A then do B then do Y

X would land their version on master and later there would be a conflict when Y tried to land their own version:

do A then do B <<<<<<<< then do X -------- then do Y >>>>>>>>

Git was confused: did you want new line X or new line Y at the end of the file, or both, and if both then in what order? But the answer was always the same: we wanted both, X and then Y, in that order:

do A then do B then do X then do Y

With the merge attribute set to union for this file, Git automatically chooses the correct resolution.

So, returning to our pickle, I wanted to set the merge attribute for the 18 back-end files to tell Git to always choose the version already in master, and always ignore the changes from the branch I was merging.

There is not exactly a way to do this, but the mechanism that is provided is extremely general, and it is not hard to get it to do what we want in this case.

The merge attribute in .gitattributes specifies the name of a “driver” that resolves merge conflicts. The driver can be one of a few built-in drivers, such as the union driver I just described, or it can be the name of a user-supplied driver, configured in .gitconfig. The first step is to use .gitattributes to tell Git to use our private, special-purpose driver for the 18 back-end files:

new-file-1 merge=ours new-file-2 merge=ours … new-file-18 merge=ours

(The name ours here is completely arbitrary. I chose it because its function was analogous to the -s ours and -X ours options of git-merge.)

Then we add a section to .gitconfig to say what the ours driver should do:

[merge "ours"] name = always prefer our version to the one being merged driver = true

The name is just a human-readable description and is ignored by Git. The important part is the deceptively simple-appearing driver = true line. The driver is actually a command that is run when there is a merge conflict. The command is run with the names of three files containing different versions of the target file: the main file being merged into, and temporary files containing the version with the conflicting changes and the common ancestor of the first two files. It is the job of the driver command to examine the three files, figure out how to resolve the conflict, and modify the main file appropriately.

In this case merging the two or three versions of the file is very simple. The main version is the one on the master branch, already perfect. The proposed changes are superfluous, and we want to ignore them. To modify the main file appropriately, our merge driver command needs to do exactly nothing. Unix helpfully provides a command that does exactly nothing, called true, so that's what we tell Git to use to resolve merge conflicts.

With this configured, and the changes to .gitattributes checked in, I was able to rebase the 406-commit topic branch onto the current master. There were some minor issues to work around, so it was not quite routine, but the problem was basically solved and it wasn't a giant pain.

I didn't actually use git-rebase

I should confess that I didn't actually use git-rebase at this point; I did it semi-manually, by generating a list of commit IDs and then running a loop that cherry-picked them one at a time:

tac /tmp/commit-ids | while read commit; do git cherry-pick $commit || break done

I don't remember why I thought this would be a better idea than just using git-rebase, which is basically the same thing. (Superstitious anxiety, perhaps.) But I think the process and the result were pretty much the same. The main drawback of my approach is that if one of the cherry-picks fails, and the loop exits prematurely, you have to hand-edit the commit-ids file before you restart the loop, to remove the commits that were successfully picked.

Also, it didn't work on the first try

My first try at the rebase didn't quite work. The merge driver was working fine, but some commits that it wanted to merge modified only the 18 back-end files and nothing else. Then there were merge conflicts, which the merge driver said to ignore, so that the net effect of the merged commit was to do nothing. But git-rebase considers that an error, says something like

The previous cherry-pick is now empty, possibly due to conflict resolution. If you wish to commit it anyway, use: git commit --allow-empty

and stops and waits for manual confirmation. Since 140 of the 406 commits modified only the 18 perfect files I was going to have to intervene manually 140 times.

I wanted an option that told git-cherry-pick that empty commits were okay and just to ignore them entirely, but that option isn't in there. There is something almost as good though; you can supply --keep-redundant-commits and instead of failing it will go ahead and create commits that make no changes. So I ended up with a branch with 406 commits of which 140 were empty. Then a second git-rebase eliminated them, because the default behavior of git-rebase is to discard empty commits. I would have needed that final rebase anyway, because I had to throw away the extra commit I added at the beginning to check in the changes to the .gitattributes file.

A few conflicts remained

There were three or four remaining conflicts during the giant rebase, all resulting from the following situation: Some of the back-end files were created under different names, edited, and later moved into their final positions. The commits that renamed them had unresolvable conflicts: the commit said to rename A to B, but to Git's surprise B already existed with different contents. Git quite properly refused to resolve these itself. I handled each of these cases manually by deleting A.

I made this up as I went along

I don't want anyone to think that I already had all this stuff up my sleeve, so I should probably mention that there was quite a bit of this I didn't know beforehand. The merge driver stuff was all new to me, and I had to work around the empty-commit issue on the fly.

Also, I didn't find a working solution on the first try; this was my second idea. My notes say that I thought my first idea would probably work but that it would have required more effort than what I described above, so I put it aside planning to take it up again if the merge driver approach didn't work. I forget what the first idea was, unfortunately.

Named commits

This is a minor, peripheral technique which I think is important for everyone to know, because it pays off far out of proportion to how easy it is to learn.

There were several commits of interest that I referred to repeatedly while investigating and fixing the pickle. In particular:

  • The last commit on the topic branch
  • The first commit on the topic branch that wasn't on master
  • The commit on master from which the topic branch diverged

Instead of trying to remember the commit IDs for these I just gave them mnemonic names with git-branch: last, first, and base, respectively. That enabled commands like git log base..last … which would otherwise have been troublesome to construct. Civilization advances by extending the number of important operations which we can perform without thinking of them. When you're thinking "okay, now I need to rebase this branch" you don't want to derail the train of thought to remember where the bottom of the branch is every time. Being able to refer to it as first is a big help.

Other approaches

After it was all over I tried to answer the question “What should X have done in the first place to avoid the pickle?” But I couldn't think of anything, so I asked Rik Signes. Rik immediately said that X should have used git-filter-branch to separate the 406 commits into two branches, branch A with just the changes to the 18 back-end files and branch B with just the changes to the other files. (The two branches together would have had more than 406 commits, since a commit that changed both back-end and front-end files would be represented in both branches.) Then he would have had no trouble landing branch A on master and, after it was deployed, landing branch B.

At that point I realized that git-filter-branch also provided a less peculiar way out of the pickle once we were in: Instead of using my merge driver approach, I could have filtered the original topic branch to produce just branch B, which would have rebased onto master just fine.

I was aware that git-filter-branch was not part of my personal toolkit, but I was unaware of the extent of my unawareness. I would have hoped that even if I hadn't known exactly how to use it, I would at least have been able to think of using it. I plan to set aside an hour or two soon to do nothing but mess around with git-filter-branch so that next time something like this happens I can at least consider using it.

It occurred to me while I was writing this that it would probably have worked to make one commit on master to remove the back-end files again, and then rebase the entire topic branch onto that commit. But I didn't think of it at the time. And it's not as good as what I did do, which left the history as clean as was possible at that point.

I think I've written before that this profusion of solutions is the sign of a well-designed system. The tools and concepts are powerful, and can be combined in many ways to solve many problems that the designers didn't foresee.

Categories: Offsite Blogs

Douglas M. Auclair (geophf): October 2016 1Liner 1HaskellADay problem and solutions

Planet Haskell - Sat, 12/10/2016 - 11:34am
  • October 21st, 2016:
    You have l1 :: [(v, [(k, x)])]
    You need the transformation l2 :: [(k, [(v, x)])]
    Redistribute v and k in one line
    Props for elegance
    • Francisco T @aiceou redist xs = fromListWith (++) $ concat $ (map f xs) where f (a,ys) = map (\(x,y) -> (x,[(a,y)])) ys ... but k has to be 'Ord'
Categories: Offsite Blogs

Philip Wadler: Do you have Q?

Planet Haskell - Sat, 12/10/2016 - 5:24am

A study conducted at Northeastern analyses the factors that contribute to success in science. Age is not one of them.The research team began by focusing on career physicists. It ransacked the literature going back to 1893, identifying 2,856 physicists with careers of 20 years or more who published at least one paper every five years — widely cited findings rated as “impact” papers — and the team analyzed when in a career those emerged. ...
[K]eeping productivity equal, the scientists were as likely to score a hit at age 50 as at age 25. The distribution was random; choosing the right project to pursue at the right time was a matter of luck.
Yet turning that fortuitous choice into an influential, widely recognized contribution depended on another element, one the researchers called Q.
Q could be translated loosely as “skill,” and most likely includes a broad variety of factors, such as I.Q., drive, motivation, openness to new ideas and an ability to work well with others. Or, simply, an ability to make the most of the work at hand: to find some relevance in a humdrum experiment, and to make an elegant idea glow.
“This Q factor is so interesting because it potentially includes abilities people have but may not recognize as central,” said Zach Hambrick, a professor of psychology at Michigan State University. “Clear writing, for instance. Take the field of mathematical psychology. You may publish an interesting finding, but if the paper is unreadable, as so many are, you can’t have wide impact because no one understands what you’re talking about.”Benedict Carey, New York Times, When It Comes to Success, Age is Just a Number.
Categories: Offsite Blogs

Contextual isomorphisms

Lambda the Ultimate - Fri, 12/09/2016 - 3:37pm

Contextual Isomorphisms
Paul Blain Levy
2017

What is the right notion of "isomorphism" between types, in a simple type theory? The traditional answer is: a pair of terms that are inverse, up to a specified congruence. We firstly argue that, in the presence of effects, this answer is too liberal and needs to be restricted, using Führmann’s notion of thunkability in the case of value types (as in call-by-value), or using Munch-Maccagnoni’s notion of linearity in the case of computation types (as in call-by-name). Yet that leaves us with different notions of isomorphism for different kinds of type.

This situation is resolved by means of a new notion of “contextual” isomorphism (or morphism), analogous at the level of types to contextual equivalence of terms. A contextual morphism is a way of replacing one type with the other wherever it may occur in a judgement, in a way that is preserved by the action of any term with holes. For types of pure λ-calculus, we show that a contextual morphism corresponds to a traditional isomorphism. For value types, a contextual morphism corresponds to a thunkable isomorphism, and for computation types, to a linear isomorphism.

This paper is based on a very simple idea that everyone familiar with type-systems can enjoy. It then applies it in a technical setting in which it brings a useful contribution. I suspect that many readers will find that second part too technical, but they may still enjoy the idea itself. To facilitate this, I will rephrase the abstract above in a way that I hope makes it accessible to a larger audience.

The problem that the paper solves is: how do we know what it means for two types to be equivalent? For many languages they are reasonable definitions of equivalence (such that: there exists a bijection between these two types in the language), but for more advanced languages these definitions break down. For example, in presence of hidden mutable state, one can build a pair of functions from the one-element type unit to the two-element type bool and back that are the identity when composed together -- the usual definition of bijection -- while these two types should probably not be considered "the same". Those two functions share some hidden state, so they "cheat". Other, more complex notions of type equivalence have been given in the literature, but how do we know whether they are the "right" notions, or whether they may disappoint us in the same way?

To define what it means for two program fragments to be equivalent, we have a "gold standard", which is contextual equivalence: two program fragments are contextually equivalent if we can replace one for the other in any complete program without changing its behavior. This is simple to state, it is usually clear how to instantiate this definition for a new system, and it gives you a satisfying notion of equivalent. It may not be the most convenient one to work with, so people define others, more specific notions of equivalence (typically beta-eta-equivalence or logical relations); it is fine if they are more sophisticated, and their definiton harder to justify or understand, because they can always be compared to this simple definition to gain confidence.

The simple idea in the paper above is to use this exact same trick to define what it means for two types to be equivalent. Naively, one could say that two types are equivalent if, in any well-typed program, one can replace some occurrences of the first type by occurrences of the second type, all other things being unchanged. This does not quite work, as changing the types that appear in a program without changing its terms would create ill-typed terms. So instead, the paper proposes that two types are equivalent when we are told how to transform any program using the first type into a program using the second type, in a way that is bijective (invertible) and compositional -- see the paper for details.

Then, the author can validate this definition by showing that, when instantiated to languages (simple or complex) where existing notions of equivalence have been proposed, this new notion of equivalence corresponds to the previous notions.

(Readers may find that even the warmup part of the paper, namely the sections 1 to 4, pages 1 to 6, are rather dense, with a compactly exposed general idea and arguably a lack of concrete examples that would help understanding. Surely this terseness is in large part a consequence of strict page limits -- conference articles are the tweets of computer science research. A nice side-effect (no pun intended) is that you can observe a carefully chosen formal language at work, designed to expose the setting and perform relevant proofs in minimal space: category theory, and in particular the concept of naturality, is the killer space-saving measure here.)

Categories: Offsite Discussion

Mark Jason Dominus: Ysolo has been canceled

Planet Haskell - Thu, 12/08/2016 - 5:58pm

An earlier article discussed how I discovered that a hoax item in a Wikipedia list had become the official name of a mountain, Ysolo Mons, on the planet Ceres.

I contacted the United States Geological Survey to point out the hoax, and on Wednesday I got the following news from their representative:

Thank you for your email alerting us to the possibility that the name Ysolo, as a festival name, may be fictitious.

After some research, we agreed with your assessment. The IAU and the Dawn Team discussed the matter and decided that the best solution was to replace the name Ysolo Mons with Yamor Mons, named for the corn/maize festival in Ecuador. The WGPSN voted to approve the change.

Thank you for bringing the matter to our attention.

(“WGPSN” is the IAU's Working Group for Planetary System Nomenclature. Here's their official announcement of the change, the USGS record of the old name and the USGS record of the new name.)

This week we cleaned up a few relevant Wikipedia articles, including one on Italian Wikipedia, and Ysolo has been put to rest.

I am a little bit sad to see it go. It was fun while it lasted. But I am really pleased about the outcome. Noticing the hoax, following it up, and correcting the name of this mountain is not a large or an important thing, but it's a thing that very few people could have done at all, one that required my particular combination of unusual talents. Those opportunities are seldom.

[ Note: The USGS rep wishes me to mention that the email I quoted above is not an official IAU communication. ]

Categories: Offsite Blogs

FP Complete: Concurrency and Node

Planet Haskell - Tue, 12/06/2016 - 8:00pm
<html>

Example code can be found here.

When Node.JS first came onto the scene it successfully popularized the event-loop. Ryan Dahl correctly identified a serious problem with the way that I/O is generally handled in concurrent environments. Many web servers, for example achieve concurrency by creating a new thread for every connection. In most platforms, this comes at a substantial cost. The default stack size in Java is 512KB, which means that if you have 1000 concurrent connections, your program will consume half a gigabyte of memory just for stack space. Additionally, forking threads in most systems costs an enormous amount of time, as does performing a context switch between two threads.

To address these issues, Node.JS uses a single thread with an event-loop. In this way, Node can handle 1000s of concurrent connections without any of the traditional detriments associated with threads. There is essentially no memory overhead per-connection, and there is no context switching.

When compared to traditional threading architectures, this is exactly the right approach to handling concurrency. In fact, Erlang's OTP libraries function in a very similar way. Each actor has a queue of messages, and only processes a single message at a time. Only after fully processing one message does it move on to the next.

However, this programming model imposes some extreme costs when it comes to the legibility of code. Consider the following two examples.

var request = require('request'); request('http://example.com/random-number', function (error, response, body) { if (!error && response.statusCode === 200) { console.log(JSON.parse(body)); } });var request = require('request'); var response = request('http://example.com/random-number'); if (response.statusCode === 200) { console.log(JSON.parse(response.body)); }

In the first example we see what an HTTP request would typically look like in Node. In the second we see what it could be like if Node had threads.

As a direct result of the event-loop, we have lost the ability to express a program as a linear series of steps. Which becomes even more important were we to need to make many I/O calls in the same function.

request('http://example.com/random-number', function(error, response1, body) { request('http://example.com/random-number', function(error, response2, body) { request('http://example.com/random-number', function(error, response3, body) { ... }); }); });

versus:

var response1 = request('http://example.com/random-number'); var response2 = request('http://example.com/random-number'); var response3 = request('http://example.com/random-number');

Which leads to the 'callback hell' that we all know and hate. Part of the bill-of-goods we accept when using Node is that in exchange for better time and space characteristics, we lose the thread as an abstraction.

What is the value of a thread

It is important to keep in mind that a thread isn't just a primitive for dealing with concurrency. It is also a powerful abstraction to make the existence of latency invisible to the developer.

{-# LANGUAGE OverloadedStrings #-} import Data.Aeson import Network.HTTP.Simple import Network.HTTP.Types.Status main :: IO () main = do res <- httpJSON "http://httpbin.org/get" if getResponseStatus res == status200 then print $ (getResponseBody res :: Value) else error $ show $ getResponseStatus res

Looking at the code above we notice a characteristic difference about it when compared to the asynchronous JavaScript examples above. There are no callbacks. When we say response <- get ..., the program will halt until we get a response back from the server. We have written a piece of code that is linear, from step-to-step, and the underlying system knows that the server on the other end will take time to respond and handles it accordingly.

This is possible because a thread keeps track of the state belonging to an ongoing computation. By tracking that state, a thread can halt and resume a computation arbitrarily. In the event of I/O, a thread will halt a computation while the I/O is occurring, and only resume when a response comes back. Giving your program the appearance of having no delay.

A thread, therefore, is both a way of running multiple computations concurrently and also accounting for the asynchronous nature of I/O. By not exposing this abstraction to the developer, two huge drawbacks exist in Node as a concurrency-oriented platform.

In Node.JS concurrency can only occur when adjacent to I/O

Node.JS only exposes one primary thread to a program. While it may spawn new threads under the hood to deal with input buffers, you don't have any control over them. As a result, you cannot write a program that takes advantage of multiple cores in a machine. The result is that your program will only be able to perform two actions concurrently if at least one of them is bounded by I/O.

To demonstrate, we have setup an example that you can download alongside this article. It is linked to at the top of the article. Once you have downloaded the example, look in the README.md for the "Local vs. Remote Test" instructions.

We start by defining a web server that supports a slow operation. In our case, generating Fibonacci numbers by the most naive implementation. Then we define two tests in Node.JS and two tests in Haskell. Using each language's native async libraries to attempt to acquire multiple Fibonacci numbers concurrently. In each language one test times how long it takes to concurrently request two Fibonacci numbers from our web server, the other times how long it takes to do the same operation locally. All tests are also run without the async library for comparison.

Test Name Without Async Time With Async Time Async / No Async Node - Local 6.439s 6.306s 0.979 Node - Remote 4.391s 2.500s 0.569 Haskell - Local 3.849s 2.487s 0.646 Haskell - Remote 4.117s 2.521s 0.612

Taking a look at the first row in our table, we find that Node.JS when attempting to concurrently compute two Fibonacci numbers is unable to give us any time savings in spite of the tests being run on a multi-core machine.

return async.parallel([ (callback) => { x = getFibs(43); callback(); }, (callback) => { y = getFibs(44); callback(); } ]).then(() => {

Even when the functions to generate numbers are run inside async.parallel they are unable to run concurrently. This is because the Node.JS event-loop only allows one callback to be executed at any given time. So instead of running the two callbacks defined above in parallel, they are run sequentially. If we make I/O calls inside our callbacks, as we do in the Node - Remote test, the system is able to issue both requests to a web server concurrently.

return async.parallel([ (callback) => { request('http://localhost:8080/api/fibs/43') .then((xStr) => { x = JSON.parse(xStr).fibResult; callback(); }); }, (callback) => { request('http://localhost:8080/api/fibs/44') .then((yStr) => { y = JSON.parse(yStr).fibResult; callback(); }); } ]).then(() => {In Node.JS concurrent computations starve each other of resources

This inability to handle multiple computations at the same time hampers Node.JS even in its intended use case as a web server. The phenomenon is referred to as "starvation" and we can demonstrate it by reversing our previous tests.

To run the demonstration code, download the example project linked at the top of the article. Then look in the README.md for instructions on the "Starvation Test."

In this demonstration we setup a web server in each language. The web server contains two routes. The first route is a very fast computation, calculating the square of the input number. The second route is a very slow computation, calculating the Nth Fibonacci number based on the input. We then constructed a program that could perform two tests against each web server.

In the first test, labeled "Fast Only," the test program counts the number of requests the each server can respond to in 5 seconds; using only the fast route. In the second test, labeled "Fast with Slow", the test program makes one request to the slow route, then counts how many requests to the fast route it can respond to in 5 seconds.

To make this demonstration as compelling as possible, we have also disabled threading in the Haskell run-time system on the Haskell web server. The result is that the Haskell server will only be able to use its internal thread model. It will not be able to create additional operating system threads, and it will not be able to take advantage of more than one core.

Test Name Request Throughput (higher is better) Node - Fast Only 572 Node - Fast with Slow 114 Haskell - Fast Only 591 Haskell - Fast with Slow 589

The results are quite striking. From Haskell's baseline of 591, having a background computation only reduced the throughput by 2 requests; which is likely smaller than our margin of error. Node.JS on the other hand was reduced from its baseline to one fifth capacity.

The difference here is stark because in Node.JS's execution model, the moment it receives a request on the slow route it must fully complete the computation for that route before it can begin on a new request. There is no I/O action taking place which will allow the server to jump to processing a new request.

In contrast, Haskell can preempt threads in execution. Which means that when the Haskell server is chugging along on the slow request, and a request on the fast route comes in, it can jump over and process it. Then returning to the slow request while the network is delivering the fast request's response back to a client.

Haskell does not suffer from traditional threading overheads

When Ryan Dahl originally presented Node, his biggest concern with threaded systems was their memory consumption. He pointed to an example of Apache versus Nginx to show that Nginx had a massively lighter footprint because of its event-loop architecture (whereas Apache used one thread per connection). The difference in memory usage was at least an order of magnitude as the connection count increased.

To demonstrate that Haskell's threading architecture does not suffer this problem, we have constructed one final example, again available at the repository linked above. Look in the README.md for "Thread Spawn".

In this example we create 100,000 threads each with a 'start' and 'stop' MVar; which is Haskell's equivalent to a locked variable. We then issue a start command on each start MVar, wait three seconds, and finally issue a stop commend on each end MVar.

When we run the program, we do so with threading enabled, and with the -s option on the run-time system; which gives us the following output.

stack exec -- haskell-thread-spawn +RTS -sout Created 100000 threads. Destroyed 100000 threads. cat out /home/andrew/src/fpcomplete/fpco-article-examples/concurrency-and-node/.stack-work/install/x86_64-linux-dkda49f7ca9b244180d3cfb1987cbc9743/lts-7.11/8.0.1/bin/haskell-thread-spawn +RTS -N -sout 137,593,048 bytes allocated in the heap 405,750,472 bytes copied during GC 77,103,392 bytes maximum residency (11 sample(s)) 10,891,144 bytes maximum slop 165 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 104 colls, 104 par 0.332s 0.082s 0.0008s 0.0186s Gen 1 11 colls, 10 par 0.620s 0.147s 0.0134s 0.0425s Parallel GC work balance: 28.08% (serial 0%, perfect 100%) TASKS: 18 (1 bound, 17 peak workers (17 total), using -N8) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.001s elapsed) MUT time 0.312s ( 3.110s elapsed) GC time 0.952s ( 0.229s elapsed) EXIT time 0.000s ( 0.003s elapsed) Total time 1.316s ( 3.343s elapsed) Alloc rate 441,003,358 bytes per MUT second Productivity 27.7% of total user, 10.9% of total elapsed gc_alloc_block_sync: 104599 whitehole_spin: 0 gen[0].sync: 504 gen[1].sync: 2465

Looking near the top of the output, we see that Haskell's run-time system was able to create 100,000 threads while only using 165 megabytes of memory. We are roughly consuming 1.65 kilobytes per thread. So an average web server, which might see 2,000 concurrent connections at the top end, would expect to use only 3 megabytes of memory for being multi-threaded.

Why Haskell

Node and Haskell both offer a solution to the traditional woes of using threading in a web server. They both allow you to deal with very high numbers of concurrent connections without a huge memory overhead. However, Haskell doesn't impose an additional burden on the design of your software to accomplish that goal. You don't have to worry about routes that take varying amounts of time. You don't have to dump long-running tasks off to an event queue. You don't have to spawn multiple processes to utilize multiple cores of a machine.

We've only just barely scratched the surface of what you can do with Haskell. If you're interested in learning more, please check out our Haskell Syllabus for a recommended learning route. There's also lots of content on the haskell-lang get started page.

FP Complete also provides corporate and group webinar training sessions. Please check out our training page for more information, or see our consulting page for how we can help your team succeed with devops and functional programming.

Contact FP Complete

Contact FP Complete

EmailNameMessageContact me!

Or you can email us at consulting@fpcomplete.com

</html>
Categories: Offsite Blogs

Neil Mitchell: Undefined Behaviour in C

Planet Haskell - Tue, 12/06/2016 - 12:15pm

Summary: I tripped over undefined behaviour in C. It's annoying.

I've recently been writing some C code to parse XML quickly. While working on that project, I inadvertently wrote some code which is undefined according to the C language standard. The code compiled and ran fine using Visual Studio, but under gcc (even at -O0) it corrupted memory, sometimes leading to a segfault, but usually just leading to a wrong answer. The code in question was (see full code at GitHub):

d->nodes.nodes[0].nodes = parse_content(d);

To give some context, d is a structure that contains various pieces of state - what the string to be parsed is, how much we have parsed, along with a pointer to the output nodes. The parse_content function parses the bit inside an XML tag, returning the indicies in nodes which it used.

The complication comes from nodes not being a fixed size, but dynamically resized if the number of nodes exceeds the capacity. For big documents that means parse_content will reallocate d->nodes.nodes.

According to the C spec, the compiler can evaluate the LHS and RHS of an assignment in any order. Since gcc computes the location of d->nodes.nodes[0] before calling parse_content it uses the address of the node before reallocation. After reallocation the address will have changed, and the assignment will be made to the wrong location.

I spotted the bug by inserting printf statements, and in doing so, I had to rewrite the code to:

str content = parse_content(d);
d->nodes.nodes[0].nodes = content;

That fixes the issue, since now the evaluation order is strictly defined. As a simplified example of the same issue:

char* array;

char f() {
array = malloc(42);
return 'x';
}

void test() {
array = malloc(0);
array[0] = f();
}

Here the line array[0] = f() might assign to either the result of malloc(0) or malloc(42), at the compilers discretion.

I manually checked if I had made any other such mistakes, and I couldn't find any. Naturally, I wanted to find a static checker that could detect such a mistake, so I tried a bunch of them. I wasn't very successful:

  • Visual Studio 2015 code analysis made me write assert after each malloc, but nothing further.
  • PVS Studio found nothing.
  • Clang undefined behaviour found nothing, and seemingly doesn't work on Windows.
  • GCC undefined behaviour found nothing, and seemingly doesn't work on Windows.
  • RV-Match hit a stack-overflow when running the program.
Categories: Offsite Blogs

Paul Johnson: What duties to software developers owe to users?

Planet Haskell - Sat, 12/03/2016 - 5:12pm
I was reading this blog post, entitled "The code I’m still ashamed of". 
TL; DR: back in 2000 the poster, Bill Sourour, was employed to write a web questionnaire aimed at teenage girls that purported to advise the user about their need for a particular drug. In reality unless you said you were allergic to it, the questionnaire always concluded that the user needed the drug. Shortly after, Sourour read about a teenage girl who had possibly committed suicide due to side effects of this drug. He is still troubled by this.
Nothing the poster or his employer did was illegal. It may not even have been unethical, depending on exactly which set of professional ethics you subscribe to. But it seems clear to me that there is something wrong in a program that purports to provide impartial advice while actually trying to trick you into buying medication you don't need. Bill Sourour clearly agrees.
Out in meatspace we have a clearly defined set of rules for this kind of situation. Details vary between countries, but if you consult someone about legal, financial or medical matters then they are generally held to have a "fiduciary duty" to you. The term derives from the Latin for "faithful". If X has a fiduciary duty to Y, then X is bound at all times to act in the best interests of Y. In such a case X is said to be "the fiduciary" while Y is the "beneficiary".
In many cases fiduciary duties arise in clearly defined contexts and have clear bodies of law or other rules associated with them. If you are the director of a company then you have a fiduciary duty to the shareholders, and most jurisdictions have a specific law for that case. But courts can also find fiduciary duties in other circumstances. In English law the general principle is as follows:"A fiduciary is someone who has undertaken to act for and on behalf of another in a particular matter in circumstances which give rise to a relationship of trust and confidence."It seems clear to me that this describes precisely the relationship between a software developer and a user. The user is not in a position to create the program they require, so they use one developed by someone else. The program acts as directed by the developer, but on behalf of the user. The user has to trust that the program will do what it promises, and in many cases the program will have access to confidential information which could be disclosed to others against the user's wishes.

These are not theoretical concerns. "Malware" is a very common category of software, defined as:
any software used to disrupt computer or mobile operations, gather sensitive information, gain access to private computer systems, or display unwanted advertising. Sometimes malware is illicitly introduced by hacking, but in many cases the user is induced to run the malware by promises that it will do something that the user wants. In that case, software that acts against the interests of the user is an abuse of the trust placed in the developer by the user. In particular, the potential for software to "gather sensitive information" and "gain access to private computer systems" clearly shows that the user must have a "relationship of trust and confidence" with the developer, even if they have never met.

One argument against my thesis came up when I posted a question about this to Legal forum on Stack Exchange. The answer I got from Dale M argued that:

Engineers (including software engineers) do not have this [relationship of confidence] and AFAIK a fiduciary duty between an engineer and their client has never been found, even where the work is a one-on-one commission.I agree that, unlike a software developer, all current examples of a fiduciary duty involve a relationship in which the fiduciary is acting directly. The fiduciary has immediate knowledge of the circumstances of the particular beneficiary, and decides from moment to moment to take actions that may or may not be in the beneficiary's best interest. In contrast a software developer is separated in time from the user, and may have little or no knowledge of the user's situation.

I didn't argue with Dale M because Stack Exchange is for questions and answers, not debates. However I don't think that the distinction drawn by Dale M holds for software. An engineer designing a bridge is not in a position to learn the private information of those who cross the bridge, but a software engineer is often in a position to learn a great deal about the users of their product. It seems to me that this leads inescapably to the conclusion that software engineers do have a relationship of confidence with the user, and that this therefore creates a fiduciary duty.

Of course, as Dale M points out, nobody has ever persuaded a judge that software developers owe a fiduciary duty, and its likely that in practice its going to be a hard sell. But to go back to the example at the top, I think that Bill Sourer, or his employer, did owe a fiduciary duty to those people who ran the questionnaire software he wrote, because they disclosed private information in the expectation of getting honest advice, and the fact that they disclosed it to a program instead of a human makes no difference at all.


Addendum: Scope of duty This section looks at exactly what the scope of the fiduciary duty is. It doesn't fit within the main text of this essay, so I've put it here.

Fortunately there is no need for a change in the law regarding fiduciary duty. The existence of a fiduciary duty is based on the nature of the relationship between principal and agent, although in some countries specific cases such as company directors are covered by more detailed laws.

First it is necessary to determine exactly who the fiduciary is. So far I have talked about "the software developer", but in practice software is rarely written by a single individual. We have to look at the authority that is directing the effort and deciding what functions will be implemented. If the software is produced by a company then treating the company as the fiduciary would seem to be the best approach, although it might be more appropriate to hold a senior manager liable if they have exceeded their authority.

As for the scope, I'm going to consider the scope of the fiduciary duty imposed on company directors and consider whether an analogous duty should apply to a software developer:

  • Duty of care: for directors this is the duty to inform themselves and take due thought before making a decision.  One might argue that a software developer should have a similar duty of care when writing software, but this is already handled through normal negligence. Elevating the application of normal professional skill to a fiduciary duty is not going to make life better for the users. However there is one area where this might be applied: lack of motive to produce secure software is widely recognised as a significant problem, and is also an area where the "confidence" aspect of fiduciary duty overlaps with a duty of care. Therefore developers who negligently fail to consider security aspects of their software should be considered to have failed in their fiduciary duty.
  • Duty of loyalty: for directors this is the duty not to use their position to further their private interests. For a software developer this is straightforward: the developer should not use their privileged access to the user's computer to further their private interests. So downloading information from the users computer (unless the user explicitly instructs this to happen) should be a breach of fiduciary duty. So would using the processing power or bandwidth owned by the user for the developers own purposes, for instance by mining bitcoins or sending spam.
  • Duty of good faith: the developer should write code that will advance the user's interests and act in accordance with the user's wishes at all times.
  • Duty of confidentiality: if the developer is entrusted with user information, for example because the software interfaces with cloud storage, then this should be held as confidential and not disclosed for the developer's benefit.
  • Duty of prudence: This does not map onto software development.
  • Duty of disclosure: for a director this providing all relevant information to the shareholders. For a software developer, it means completely and honestly documenting what the software does, and particularly drawing attention to any features which a user might reasonably consider against their interests.  Merely putting some general clauses in the license is not sufficient; anything that could reasonably be considered to be contrary to the user's interests should be prominently indicated in a way that enables the user to prevent it.
One gray area in this is software that is provided in exchange for personal data. Many "free" apps are paid for by advertisers who, in addition to the opportunity to advertise to the user, also pay for data about the users. On one hand, this involves the uploading of personal data that the user may not wish to share, but on the other hand it is done as part of an exchange that the user may be happy with. This comes under the duty of disclosure. The software should inform the user that personal data will be uploaded, and should also provide a detailed log of exactly what has been sent. Thus users can make informed decisions about the value of the information they are sending, and possibly alter their behavior when they know it is being monitored.


    Categories: Offsite Blogs

    Douglas M. Auclair (geophf): November 2016 1HaskellADay Problems and Solutions

    Planet Haskell - Wed, 11/30/2016 - 9:38pm
    Categories: Offsite Blogs