News aggregator

Functional Jobs: Software Engineer (Haskell/Clojure) at Capital Match (Full-time)

Planet Haskell - Fri, 10/28/2016 - 3:15am

Overview

CAPITAL MATCH is a leading marketplace lending and invoice financing platform in Singapore. Our in-house platform, mostly developed in Haskell, has in the last year seen more than USD 15 million business loans processed with a strong monthly growth (current rate of USD 1.5-2.5 million monthly). We are also eyeing expansion into new geographies and product categories. Very exciting times!

We have just secured another funding round to build a world-class technology as the key business differentiator. The key components include credit risk engine, seamless banking integration and end-to-end product automation from loan origination to debt collection.

Responsibilities

We are looking to hire a software engineer with a minimum of 2-3 years coding experience.

The candidate should have been involved in a development of multiple web-based products from scratch. He should be interested in all aspects of the creation, growth and operations of a secure web-based platform: front-to-back features development, distributed deployment and automation in the cloud, build and test automation etc.

Background in fintech and especially lending / invoice financing space would be a great advantage.

Requirements

Our platform is primarily developed in Haskell with an Om/ClojureScript frontend. We are expecting our candidate to have experience working with a functional programming language e.g. Haskell/Scala/OCaml/F#/Clojure/Lisp/Erlang.

Deployment and production is managed with Docker containers using standard cloud infrastructure so familiarity with Linux systems, command-line environment and cloud-based deployment is mandatory. Minimum exposure to and understanding of XP practices (TDD, CI, Emergent Design, Refactoring, Peer review and programming, Continuous improvement) is expected.

We are looking for candidates that are living in or are willing to relocate to Singapore.

Offer

We offer a combination of salary and equity depending on experience and skills of the candidate.

Most expats who relocate to Singapore do not have to pay their home country taxes and the local tax rate in Singapore is more or less 5% (effective on the proposed salary range).

Visa sponsorship will be provided.

Singapore is a great place to live, a vibrant city rich with diverse cultures, a very strong financial sector and a central location in Southeast Asia.

www.capital-match.com

Get information on how to apply for this position.

Categories: Offsite Blogs

Gabriel Gonzalez: Electoral vote distributions are Monoids

Planet Haskell - Thu, 10/27/2016 - 7:50am

I'm a political junkie and I spend way too much time following the polling results on FiveThirtyEight's election forecast.

A couple of days ago I was surprised that FiveThirtyEight gave Trump a 13.7% chance of winning, which seemed too high to be consistent with the state-by-state breakdowns. After reading their methodology I learned that this was due to them not assuming that state outcomes were independent. In other words, if one swing state breaks for Trump this might increase the likelihood that other swing states also break for Trump.

However, I still wanted to do the exercise to ask: what would be Hillary's chance of winning if each state's probability of winning was truly independent of one another? Let's write a program to find out!

Raw data

A couple of days ago (2016-10-24) I collected the state-by-state data from FiveThirtyEight's website (by hand) and recorded:

  • the name of the state
  • the chance that Hillary Clinton would win the state
  • the number of electoral college votes for that state

I recorded this data as a list of 3-tuples:

probabilities :: [(String, Double, Int)]
probabilities =
[ ("Alabama" , 0.003, 9)
, ("Alaska" , 0.300, 3)
, ("Arizona" , 0.529, 11)
, ("Arkansas" , 0.012, 6)
, ("California" , 0.999, 55)
, ("Colorado" , 0.889, 9)
, ("Connecticut" , 0.977, 7)
, ("Delaware" , 0.948, 3)
, ("District of Columbia", 0.999, 3)
, ("Florida" , 0.731, 29)
, ("Georgia" , 0.259, 16)
, ("Hawaii" , 0.996, 4)
, ("Idaho" , 0.026, 4)
, ("Illinois" , 0.994, 20)
, ("Indiana" , 0.121, 11)
, ("Iowa" , 0.491, 6)
, ("Kansas" , 0.089, 6)
, ("Kentucky" , 0.042, 8)
, ("Louisiana" , 0.009, 8)
, ("Maine" , 0.852, 2)
, ("Maine - 1" , 0.944, 1)
, ("Maine - 2" , 0.517, 1)
, ("Maryland" , 0.999, 10)
, ("Massachussetts" , 0.998, 11)
, ("Michigan" , 0.929, 16)
, ("Minnesota" , 0.886, 10)
, ("Mississippi" , 0.034, 6)
, ("Missouri" , 0.168, 10)
, ("Montana" , 0.119, 3)
, ("Nebraska" , 0.040, 2)
, ("Nebraska - 1" , 0.154, 1)
, ("Nebraska - 2" , 0.513, 1)
, ("Nebraska - 3" , 0.014, 1)
, ("Nevada" , 0.703, 6)
, ("New Hampshire" , 0.868, 4)
, ("New Jersey" , 0.981, 14)
, ("New Mexico" , 0.941, 5)
, ("New York" , 0.998, 29)
, ("North Carolina" , 0.689, 15)
, ("North Dakota" , 0.070, 3)
, ("Oklahoma" , 0.002, 7)
, ("Ohio" , 0.563, 18)
, ("Oregon" , 0.957, 7)
, ("Pennsylvania" , 0.880, 20)
, ("Rhode Island" , 0.974, 4)
, ("South Carolina" , 0.086, 9)
, ("South Dakota" , 0.117, 3)
, ("Tennessee" , 0.025, 11)
, ("Texas" , 0.166, 38)
, ("Utah" , 0.067, 6)
, ("Vermont" , 0.984, 3)
, ("Virginia" , 0.943, 13)
, ("Washington" , 0.975, 12)
, ("West Virginia" , 0.006, 5)
, ("Wisconsin" , 0.896, 10)
, ("Wyoming" , 0.021, 3)
]

Note that some states (like Maine) apportion electoral votes in a weird way:

probabilities :: [(String, Double, Int)]
probabilities =
...
, ("Maine" , 0.852, 2)
, ("Maine - 1" , 0.944, 1)
, ("Maine - 2" , 0.517, 1)
...
]

Maine apportions two of its electoral votes based on a state-wide vote (i.e. "Maine" in the above list) and then two further electoral votes are apportioned based on two districts (i.e. "Maine - 1" and "Maine - 2". FiveThirtyEight computes the probabilities for each subset of electoral votes, so we just record them separately.

Combinatorial explosion

So how might we compute Hillary's chances of winnings assuming the independence of each state's outcome?

One naïve approach would be to loop through all possible electoral outcomes and compute the probability and electoral vote for each outcome. Unfortunately, that's not very efficient since the number of possible outcomes doubles with each additional entry in the list:

>>> 2 ^ length probabilities
72057594037927936

... or approximately 7.2 * 10^16 outcomes. Even if I only spent a single CPU cycle to compute each outcome (which is unrealistic) on a 2.5 GHz processor that would take almost a year to compute them all. The election is only a couple of weeks away so I don't have that kind of time or computing power!

Distributions

Fortunately, we can do much better than that! We can efficiently solve this using a simple "divide-and-conquer" approach where we subdivide the large problem into smaller problems until the solution is trivial.

The central data structure we'll use is a probability distribution which we'll represent as a Vector of Doubles:

import Data.Vector.Unboxed (Vector)

newtype Distribution = Distribution (Vector Double)
deriving (Show)

This Vector will always have 539 elements, one element per possible final electoral vote count that Hillary might get. Each element is a Double representing the probability of that corresponding electoral vote count. We will maintain an invariant that all the probabilities (i.e. elements of the Vector) must sum to 1.

For example, if the Distribution were:

[1, 0, 0, 0, 0 ... ]

... that would represent a 100% chance of Hillary getting 0 electoral votes and a 0% chance of any other outcome. Similarly, if the Distribution were:

[0, 0.5, 0, 0.5, 0, 0, 0 ... ]

... then that would represent a 50% chance of Hillary getting 1 electoral vote and a 50% chance of Hillary getting 3 electoral votes.

In order to simplify the problem we need to subdivide the problem into smaller problems. For example, if I want to compute the final electoral vote probability distribution for all 50 states perhaps we can break that down into two smaller problems:

  • Split the 50 states into two sub-groups of 25 states each
  • Compute an electoral vote probability distribution for each sub-group of 25 states
  • Combine probability distributions for each sub-group into the final distribution

In order to do that, I need to define a function that combines two smaller distributions into a larger distribution:

import qualified Data.Vector.Unboxed

combine :: Distribution -> Distribution -> Distribution
combine (Distribution xs) (Distribution ys) = Distribution zs
where
zs = Data.Vector.Unboxed.generate 539 totalProbability

totalProbability i =
Data.Vector.Unboxed.sum
(Data.Vector.Unboxed.generate (i + 1) probabilityOfEachOutcome)
where
probabilityOfEachOutcome j =
Data.Vector.Unboxed.unsafeIndex xs j
* Data.Vector.Unboxed.unsafeIndex ys (i - j)

The combine function takes two input distributions named xs and ys and generates a new distribution named zs. To compute the probability of getting i electoral votes in our composite distribution, we just add up all the different ways we can get i electoral votes from the two sub-distributions.

For example, to compute the probability of getting 4 electoral votes for the entire group, we add up the probabilities for the following 5 outcomes:

  • We get 0 votes from our 1st group and 4 votes from our 2nd group
  • We get 1 votes from our 1st group and 3 votes from our 2nd group
  • We get 2 votes from our 1st group and 2 votes from our 2nd group
  • We get 3 votes from our 1st group and 1 votes from our 2nd group
  • We get 4 votes from our 1st group and 0 votes from our 2nd group

The probabilityOfEachOutcome function computes the probability of each one of these outcomes and then the totalProbability function sums them all up to compute the total probability of getting i electoral votes.

We can also define an empty distribution representing the probability distribution of electoral votes given zero states:

empty :: Distribution
empty = Distribution (Data.Vector.Unboxed.generate 539 makeElement)
where
makeElement 0 = 1
makeElement _ = 0

This distribution says that given zero states you have a 100% chance of getting zero electoral college votes and 0% chance of any other outcome. This empty distribution will come in handy later on.

Divide and conquer

There's no limit to how many times we can subdivide the problem. In the extreme case we can sub-divide the problem down to individual states (or districts for weird states like Maine and Nebraska):

  • subdivide our problem into 56 sub-groups (one group per state or district)
  • compute the probability distribution for each sub-group, which is trivial
  • combine all the probability distributions to retrieve the final result

In fact, this extreme solution is surprisingly efficient!

All we're missing is a function that converts each entry in our original probabilities list into a Distribution:

toDistribution :: (String, Double, Int) -> Distribution
toDistribution (_, probability, votes) =
Distribution (Data.Vector.Unboxed.generate 539 makeElement)
where
makeElement 0 = 1 - probability
makeElement i | i == votes = probability
makeElement _ = 0

This says that if our probability distribution for a single state should have two possible outcomes:

  • Hillary clinton has probability x of winning n votes for this state
  • Hillary clinton has probability 1 - x of winning 0 votes for this state
  • Hillary clinton has 0% probability of any other outcome for this state

Let's test this out on a couple of states:

>>> toDistribution ("Alaska" , 0.300, 3)
Distribution [0.7,0.0,0.0,0.3,0.0,0.0,...
>>> toDistribution ("North Dakota", 0.070, 3)
Distribution [0.9299999999999999,0.0,0.0,7.0e-2,0.0...

This says that:

  • Alaska has a 30% chance of giving Clinton 3 votes and 70% chance of 0 votes
  • North Dakota has a 7% chance of giving Clinton 3 votes and a 93% chance of 0 votes

We can also verify that combine works correctly by combining the electoral vote distributions of both states. We expect the new distribution to be:

  • 2.1% chance of 6 votes (the probability of winning both states)
  • 65.1% chance of 0 votes (the probability of losing both states)
  • 32.8% chance of 3 votes (the probability of winning just one of the two states)

... and this is in fact what we get:

>>> let alaska = toDistribution ("Alaska" , 0.300, 3)
>>> let northDakota = toDistribution ("North Dakota", 0.070, 3)
>>> combine alaska northDakota
Distribution [0.6509999999999999,0.0,0.0,0.32799999999999996,0.0,0.0,2.1e-2,0.0,...Final result

To compute the total probability of winning, we just transform each element of the list to the corresponding distribution:

distributions :: [Distribution]
distributions = map toDistribution probabilities

... then we reduce the list to a single value repeatedly applying the combine function, falling back on the empty distribution if the entire list is empty:

import qualified Data.List

distribution :: Distribution
distribution = Data.List.foldl' combine empty distributions

... and if we want to get Clinton's chances of winning, we just add up the probabilities for all outcomes greater than or equal to 270 electoral college votes:

chanceOfClintonVictory :: Double
chanceOfClintonVictory =
Data.Vector.Unboxed.sum (Data.Vector.Unboxed.drop 270 xs)
where
Distribution xs = distribution

main :: IO ()
main = print chanceOfClintonVictory

If we compile and run this program we get the final result:

$ stack --resolver=lts-7.4 build vector
$ stack --resolver=lts-7.4 ghc -- -O2 result.hs
$ ./result
0.9929417642334847

In other words, Clinton has a 99.3% chance of winning if each state's outcome is independent of every other outcome. This is significantly higher than the probability estimated by FiveThirtyEight at that time: 86.3%.

These results differ for the same reason I noted above: FiveThirtyEight assumes that state outcomes are not necessarily independent and that a Trump in one state could correlate with Trump wins in other states. This possibility of correlated victories favors the person who is behind in the race.

As a sanity check, we can also verify that the final probability distribution has probabilities that add up to approximately 1:

>>> let Distribution xs = distribution
>>> Data.Vector.Unboxed.sum xs
0.9999999999999994

Exercise: Expand on this program to plot the probability distribution

Efficiency

Our program is also efficient, running in 30 milliseconds:

$ bench ./result
benchmarking ./result
time 30.33 ms (29.42 ms .. 31.16 ms)
0.998 R² (0.997 R² .. 1.000 R²)
mean 29.43 ms (29.13 ms .. 29.81 ms)
std dev 710.6 μs (506.7 μs .. 992.6 μs)

This is a significant improvement over a year's worth of running time.

We could even speed this up further using parallelism. Thanks to our divide and conquer approach we can subdivide this problem among up to 53 CPUs to accelerate the solution. However, after a certain point the overhead of splitting up the work might outweigh the benefits of parallelism.

Monoids

People more familiar with Haskell will recognize that this solution fits cleanly into a standard Haskell interface known as the Monoid type class. In fact, many divide-and-conquer solutions tend to be Monoids of some sort.

The Monoid typeclass is defined as:

class Monoid m where
mappend :: m -> m -> m

mempty :: m

-- An infix operator that is a synonym for `mappend`
(<>) :: Monoid m => m -> m -> m
x <> y = mappend x y

... and the Monoid class has three rules that every implementation must obey, which are known as the "Monoid laws".

The first rule is that mappend (or the equivalent (<>) operator) must be associative:

x <> (y <> z) = (x <> y) <> z

The second and third rules are that mempty must be the "identity" of mappend, meaning that mempty does nothing when combined with other values:

mempty <> x = x

x <> mempty = x

A simple example of a Monoid is integers under addition, which we can implement like this:

instance Monoid Integer where
mappend = (+)
mempty = 0

... and this implementation satisfies the Monoid laws thanks to the laws of addition:

(x + y) + z = x + (y + z)

0 + x = x

x + 0 = x

However, Distributions are Monoids, too! Our combine and empty definitions both have the correct types to implement the mappend and mempty methods of the Monoid typeclass, respectively:

instance Monoid Distribution where
mappend = combine

mempty = empty

Both mappend and mempty for Distributions satisfy the Monoid laws:

  • mappend is associative (Proof omitted)
  • mempty is the identity of mappend

We can prove the identity law using the following rules for how Vectors behave:

-- These rules assume that all vectors involved have 539 elements

-- If you generate a vector by just indexing into another vector, you just get back
-- the other vector
Data.Vector.Unboxed.generate 539 (Data.Vector.Unboxed.unsafeIndex xs) = xs

-- If you index into a vector generated by a function, that's equivalent to calling
-- that function
Data.Vector.unsafeIndex (DataVector.generate 539 f) i = f i

Equipped with those rules, we can then prove that mappend xs mempty = xs

mapppend (Distribution xs) mempty

-- mappend = combine
= combine (Distribution xs) mempty

-- Definition of `mempty`
= combine (Distribution xs) (Distribution ys)
where
ys = Data.Vector.Unboxed.generate 539 makeElement
where
makeElement 0 = 1
makeElement _ = 0

-- Definition of `combine`
= Distribution zs
where
zs = Data.Vector.Unboxed.generate 539 totalProbability

totalProbability i =
Data.Vector.Unboxed.sum
(Data.Vector.Unboxed.generate (i + 1) probabilityOfEachOutcome)
where
probabilityOfEachOutcome j =
Data.Vector.Unboxed.unsafeIndex xs j
* Data.Vector.Unboxed.unsafeIndex ys (i - j)

ys = Data.Vector.Unboxed.generate 539 makeElement
where
makeElement 0 = 1
makeElement _ = 0

-- Data.Vector.unsafeIndex (DataVector.generate 539 f) i = f i
= Distribution zs
where
zs = Data.Vector.Unboxed.generate 539 totalProbability

totalProbability i =
Data.Vector.Unboxed.sum
(Data.Vector.Unboxed.generate (i + 1) probabilityOfEachOutcome)
where
probabilityOfEachOutcome j =
Data.Vector.Unboxed.unsafeIndex xs j
* makeElement (i - j)

makeElement 0 = 1
makeElement _ = 0

-- Case analysis on `j`
= Distribution zs
where
zs = Data.Vector.Unboxed.generate 539 totalProbability

totalProbability i =
Data.Vector.Unboxed.sum
(Data.Vector.Unboxed.generate (i + 1) probabilityOfEachOutcome)
where
probabilityOfEachOutcome j
| j == i =
Data.Vector.Unboxed.unsafeIndex xs j
* 1 -- makeElement (i - j) = makeElement 0 = 1
| otherwise =
Data.Vector.Unboxed.unsafeIndex xs j
* 0 -- makeElement (i - j) = 0

-- x * 1 = x
-- y * 0 = 0
= Distribution zs
where
zs = Data.Vector.Unboxed.generate 539 totalProbability

totalProbability i =
Data.Vector.Unboxed.sum
(Data.Vector.Unboxed.generate (i + 1) probabilityOfEachOutcome)
where
probabilityOfEachOutcome j
| j == i = Data.Vector.Unboxed.unsafeIndex xs j
| otherwise = 0

-- Informally: "Sum of a vector with one non-zero element is just that element"
= Distribution zs
where
zs = Data.Vector.Unboxed.generate 539 totalProbability

totalProbability i = Data.Vector.Unboxed.unsafeIndex xs i

-- Data.Vector.Unboxed.generate 539 (Data.Vector.Unboxed.unsafeIndex xs) = xs
= Distribution xs

Exercise: Prove the associativity law for combine

Conclusion

I hope people find this an interesting example of how you can apply mathematical design principles (in this case: Monoids) in service of simplifying and speeding up programming problems.

If you would like to test this program out yourself the complete program is provided below:

import Data.Vector.Unboxed (Vector)

import qualified Data.List
import qualified Data.Vector.Unboxed

probabilities :: [(String, Double, Int)]
probabilities =
[ ("Alabama" , 0.003, 9)
, ("Alaska" , 0.300, 3)
, ("Arizona" , 0.529, 11)
, ("Arkansas" , 0.012, 6)
, ("California" , 0.999, 55)
, ("Colorado" , 0.889, 9)
, ("Connecticut" , 0.977, 7)
, ("Delaware" , 0.948, 3)
, ("District of Columbia", 0.999, 3)
, ("Florida" , 0.731, 29)
, ("Georgia" , 0.259, 16)
, ("Hawaii" , 0.996, 4)
, ("Idaho" , 0.026, 4)
, ("Illinois" , 0.994, 20)
, ("Indiana" , 0.121, 11)
, ("Iowa" , 0.491, 6)
, ("Kansas" , 0.089, 6)
, ("Kentucky" , 0.042, 8)
, ("Louisiana" , 0.009, 8)
, ("Maine" , 0.852, 2)
, ("Maine - 1" , 0.944, 1)
, ("Maine - 2" , 0.517, 1)
, ("Maryland" , 0.999, 10)
, ("Massachussetts" , 0.998, 11)
, ("Michigan" , 0.929, 16)
, ("Minnesota" , 0.886, 10)
, ("Mississippi" , 0.034, 6)
, ("Missouri" , 0.168, 10)
, ("Montana" , 0.119, 3)
, ("Nebraska" , 0.040, 2)
, ("Nebraska - 1" , 0.154, 1)
, ("Nebraska - 2" , 0.513, 1)
, ("Nebraska - 3" , 0.014, 1)
, ("Nevada" , 0.703, 6)
, ("New Hampshire" , 0.868, 4)
, ("New Jersey" , 0.981, 14)
, ("New Mexico" , 0.941, 5)
, ("New York" , 0.998, 29)
, ("North Carolina" , 0.689, 15)
, ("North Dakota" , 0.070, 3)
, ("Oklahoma" , 0.002, 7)
, ("Ohio" , 0.563, 18)
, ("Oregon" , 0.957, 7)
, ("Pennsylvania" , 0.880, 20)
, ("Rhode Island" , 0.974, 4)
, ("South Carolina" , 0.086, 9)
, ("South Dakota" , 0.117, 3)
, ("Tennessee" , 0.025, 11)
, ("Texas" , 0.166, 38)
, ("Utah" , 0.067, 6)
, ("Vermont" , 0.984, 3)
, ("Virginia" , 0.943, 13)
, ("Washington" , 0.975, 12)
, ("West Virginia" , 0.006, 5)
, ("Wisconsin" , 0.896, 10)
, ("Wyoming" , 0.021, 3)
]

newtype Distribution = Distribution { getDistribution :: Vector Double }
deriving (Show)

combine :: Distribution -> Distribution -> Distribution
combine (Distribution xs) (Distribution ys) = Distribution zs
where
zs = Data.Vector.Unboxed.generate 539 totalProbability

totalProbability i =
Data.Vector.Unboxed.sum
(Data.Vector.Unboxed.generate (i + 1) probabilityOfEachOutcome)
where
probabilityOfEachOutcome j =
Data.Vector.Unboxed.unsafeIndex xs j
* Data.Vector.Unboxed.unsafeIndex ys (i - j)

empty :: Distribution
empty = Distribution (Data.Vector.Unboxed.generate 539 makeElement)
where
makeElement 0 = 1
makeElement _ = 0

instance Monoid Distribution where
mappend = combine

mempty = empty

toDistribution :: (String, Double, Int) -> Distribution
toDistribution (_, probability, votes) =
Distribution (Data.Vector.Unboxed.generate 539 makeElement)
where
makeElement 0 = 1 - probability
makeElement i | i == votes = probability
makeElement _ = 0

distributions :: [Distribution]
distributions = map toDistribution probabilities

distribution :: Distribution
distribution = mconcat distributions

chanceOfClintonVictory :: Double
chanceOfClintonVictory =
Data.Vector.Unboxed.sum (Data.Vector.Unboxed.drop 270 xs)
where
Distribution xs = distribution

main :: IO ()
main = print chanceOfClintonVictory
Categories: Offsite Blogs

Functional Jobs: Clojure Engineer at ROKT (Full-time)

Planet Haskell - Tue, 10/25/2016 - 11:58pm

ROKT is hiring thoughtful, talented functional programmers, at all levels, to expand our Clojure team in Sydney, Australia. (We're looking for people who already have the right to work in Australia, please.)

ROKT is a successful startup with a transaction marketing platform used by some of the world's largest ecommerce sites. Our Sydney-based engineering team supports a business that is growing rapidly around the world.

Our Clojure engineers are responsible for ROKT's "Data Platform", a web interface for our sales teams, our operations team, and our customers to extract and upload the data that drives our customers' businesses and our own. We write Clojure on the server-side, and a ClojureScript single-page application on the frontend.

We don't have a Hadoop-based neural net diligently organising our customer data into the world's most efficiently balanced red-black tree (good news: we won't ask you to write one in an interview) — instead, we try to spend our time carefully building the simplest thing that'll do what the business needs done. We're looking for programmers who can help us build simple, robust systems — and we think that means writing in a very functional style — whether that involves hooking some CV-enhancing buzzword technology on the side or not.

If you have professional Clojure experience, that's excellent, we'd like to hear about it. But we don't have a big matrix of exacting checkboxes to measure you against, so if your Clojure isn't fluent yet, we'll be happy to hear how you've been writing functional code in whatever language you're most comfortable in, whether it be Haskell or JavaScript, Common Lisp or Scala. We have the luxury of building out a solid team of thoughtful developers — no "get me a resource with exactly X years of experience in technology Y, stat!"

Get information on how to apply for this position.

Categories: Offsite Blogs

Joachim Breitner: Showcasing Applicative

Planet Haskell - Tue, 10/25/2016 - 10:00pm

My plan for this week’s lecture of the CIS 194 Haskell course at the University of Pennsylvania is to dwell a bit on the concept of Functor, Applicative and Monad, and to highlight the value of the Applicative abstraction.

I quite like the example that I came up with, so I want to share it here. In the interest of long-term archival and stand-alone presentation, I include all the material in this post.1

Imports

In case you want to follow along, start with these imports:

import Data.Char import Data.Maybe import Data.List import System.Environment import System.IO import System.Exit The parser

The starting point for this exercise is a fairly standard parser-combinator monad, which happens to be the result of the student’s homework from last week:

newtype Parser a = P (String -> Maybe (a, String)) runParser :: Parser t -> String -> Maybe (t, String) runParser (P p) = p parse :: Parser a -> String -> Maybe a parse p input = case runParser p input of Just (result, "") -> Just result _ -> Nothing -- handles both no result and leftover input noParserP :: Parser a noParserP = P (\_ -> Nothing) pureParserP :: a -> Parser a pureParserP x = P (\input -> Just (x,input)) instance Functor Parser where fmap f p = P $ \input -> do (x, rest) <- runParser p input return (f x, rest) instance Applicative Parser where pure = pureParserP p1 <*> p2 = P $ \input -> do (f, rest1) <- runParser p1 input (x, rest2) <- runParser p2 rest1 return (f x, rest2) instance Monad Parser where return = pure p1 >>= k = P $ \input -> do (x, rest1) <- runParser p1 input runParser (k x) rest1 anyCharP :: Parser Char anyCharP = P $ \input -> case input of (c:rest) -> Just (c, rest) [] -> Nothing charP :: Char -> Parser () charP c = do c' <- anyCharP if c == c' then return () else noParserP anyCharButP :: Char -> Parser Char anyCharButP c = do c' <- anyCharP if c /= c' then return c' else noParserP letterOrDigitP :: Parser Char letterOrDigitP = do c <- anyCharP if isAlphaNum c then return c else noParserP orElseP :: Parser a -> Parser a -> Parser a orElseP p1 p2 = P $ \input -> case runParser p1 input of Just r -> Just r Nothing -> runParser p2 input manyP :: Parser a -> Parser [a] manyP p = (pure (:) <*> p <*> manyP p) `orElseP` pure [] many1P :: Parser a -> Parser [a] many1P p = pure (:) <*> p <*> manyP p sepByP :: Parser a -> Parser () -> Parser [a] sepByP p1 p2 = (pure (:) <*> p1 <*> (manyP (p2 *> p1))) `orElseP` pure []

A parser using this library for, for example, CSV files could take this form:

parseCSVP :: Parser [[String]] parseCSVP = manyP parseLine where parseLine = parseCell `sepByP` charP ',' <* charP '\n' parseCell = do charP '"' content <- manyP (anyCharButP '"') charP '"' return content We want EBNF

Often when we write a parser for a file format, we might also want to have a formal specification of the format. A common form for such a specification is EBNF. This might look as follows, for a CSV file:

cell = '"', {not-quote}, '"'; line = (cell, {',', cell} | ''), newline; csv = {line};

It is straightforward to create a Haskell data type to represent an EBNF syntax description. Here is a simple EBNF library (data type and pretty-printer) for your convenience:

data RHS = Terminal String | NonTerminal String | Choice RHS RHS | Sequence RHS RHS | Optional RHS | Repetition RHS deriving (Show, Eq) ppRHS :: RHS -> String ppRHS = go 0 where go _ (Terminal s) = surround "'" "'" $ concatMap quote s go _ (NonTerminal s) = s go a (Choice x1 x2) = p a 1 $ go 1 x1 ++ " | " ++ go 1 x2 go a (Sequence x1 x2) = p a 2 $ go 2 x1 ++ ", " ++ go 2 x2 go _ (Optional x) = surround "[" "]" $ go 0 x go _ (Repetition x) = surround "{" "}" $ go 0 x surround c1 c2 x = c1 ++ x ++ c2 p a n | a > n = surround "(" ")" | otherwise = id quote '\'' = "\\'" quote '\\' = "\\\\" quote c = [c] type Production = (String, RHS) type BNF = [Production] ppBNF :: BNF -> String ppBNF = unlines . map (\(i,rhs) -> i ++ " = " ++ ppRHS rhs ++ ";") Code to produce EBNF

We had a good time writing combinators that create complex parsers from primitive pieces. Let us do the same for EBNF grammars. We could simply work on the RHS type directly, but we can do something more nifty: We create a data type that keeps track, via a phantom type parameter, of what Haskell type the given EBNF syntax is the specification:

newtype Grammar a = G RHS ppGrammar :: Grammar a -> String ppGrammar (G rhs) = ppRHS rhs

So a value of type Grammar t is a description of the textual representation of the Haskell type t.

Here is one simple example:

anyCharG :: Grammar Char anyCharG = G (NonTerminal "char")

Here is another one. This one does not describe any interesting Haskell type, but is useful when spelling out the special characters in the syntax described by the grammar:

charG :: Char -> Grammar () charG c = G (Terminal [c])

A combinator that creates new grammar from two existing grammars:

orElseG :: Grammar a -> Grammar a -> Grammar a orElseG (G rhs1) (G rhs2) = G (Choice rhs1 rhs2)

We want the convenience of our well-known type classes in order to combine these values some more:

instance Functor Grammar where fmap _ (G rhs) = G rhs instance Applicative Grammar where pure x = G (Terminal "") (G rhs1) <*> (G rhs2) = G (Sequence rhs1 rhs2)

Note how the Functor instance does not actually use the function. How should it? There are no values inside a Grammar!

We cannot define a Monad instance for Grammar: We would start with (G rhs1) >>= k = …, but there is simply no way of getting a value of type a that we can feed to k. So we will do without a Monad instance. This is interesting, and we will come back to that later.

Like with the parser, we can now begin to build on the primitive example to build more complicated combinators:

manyG :: Grammar a -> Grammar [a] manyG p = (pure (:) <*> p <*> manyG p) `orElseG` pure [] many1G :: Grammar a -> Grammar [a] many1G p = pure (:) <*> p <*> manyG p sepByG :: Grammar a -> Grammar () -> Grammar [a] sepByG p1 p2 = ((:) <$> p1 <*> (manyG (p2 *> p1))) `orElseG` pure []

Let us run a small example:

dottedWordsG :: Grammar [String] dottedWordsG = many1G (manyG anyCharG <* charG '.') *Main> putStrLn $ ppGrammar dottedWordsG '', ('', char, ('', char, ('', char, ('', char, ('', char, ('', …

Oh my, that is not good. Looks like the recursion in manyG does not work well, so we need to avoid that. But anyways we want to be explicit in the EBNF grammars about where something can be repeated, so let us just make many a primitive:

manyG :: Grammar a -> Grammar [a] manyG (G rhs) = G (Repetition rhs)

With this definition, we already get a simple grammar for dottedWordsG:

*Main> putStrLn $ ppGrammar dottedWordsG '', {char}, '.', {{char}, '.'}

This already looks like a proper EBNF grammar. One thing that is not nice about it is that there is an empty string ('') in a sequence (…,…). We do not want that.

Why is it there in the first place? Because our Applicative instance is not lawful! Remember that pure id <*> g == g should hold. One way to achieve that is to improve the Applicative instance to optimize this case away:

instance Applicative Grammar where pure x = G (Terminal "") G (Terminal "") <*> G rhs2 = G rhs2 G rhs1 <*> G (Terminal "") = G rhs1 (G rhs1) <*> (G rhs2) = G (Sequence rhs1 rhs2) ``` Now we get what we want: *Main> putStrLn $ ppGrammar dottedWordsG {char}, '.', {{char}, '.'}

Remember our parser for CSV files above? Let me repeat it here, this time using only Applicative combinators, i.e. avoiding (>>=), (>>), return and do-notation:

parseCSVP :: Parser [[String]] parseCSVP = manyP parseLine where parseLine = parseCell `sepByP` charG ',' <* charP '\n' parseCell = charP '"' *> manyP (anyCharButP '"') <* charP '"'

And now we try to rewrite the code to produce Grammar instead of Parser. This is straightforward with the exception of anyCharButP. The parser code for that inherently monadic, and we just do not have a monad instance. So we work around the issue by making that a “primitive” grammar, i.e. introducing a non-terminal in the EBNF without a production rule – pretty much like we did for anyCharG:

primitiveG :: String -> Grammar a primitiveG s = G (NonTerminal s) parseCSVG :: Grammar [[String]] parseCSVG = manyG parseLine where parseLine = parseCell `sepByG` charG ',' <* charG '\n' parseCell = charG '"' *> manyG (primitiveG "not-quote") <* charG '"'

Of course the names parse… are not quite right any more, but let us just leave that for now.

Here is the result:

*Main> putStrLn $ ppGrammar parseCSVG {('"', {not-quote}, '"', {',', '"', {not-quote}, '"'} | ''), ' '}

The line break is weird. We do not really want newlines in the grammar. So let us make that primitive as well, and replace charG '\n' with newlineG:

newlineG :: Grammar () newlineG = primitiveG "newline"

Now we get

*Main> putStrLn $ ppGrammar parseCSVG {('"', {not-quote}, '"', {',', '"', {not-quote}, '"'} | ''), newline}

which is nice and correct, but still not quite the easily readable EBNF that we saw further up.

Code to produce EBNF, with productions

We currently let our grammars produce only the right-hand side of one EBNF production, but really, we want to produce a RHS that may refer to other productions. So let us change the type accordingly:

newtype Grammar a = G (BNF, RHS) runGrammer :: String -> Grammar a -> BNF runGrammer main (G (prods, rhs)) = prods ++ [(main, rhs)] ppGrammar :: String -> Grammar a -> String ppGrammar main g = ppBNF $ runGrammer main g

Now we have to adjust all our primitive combinators (but not the derived ones!):

charG :: Char -> Grammar () charG c = G ([], Terminal [c]) anyCharG :: Grammar Char anyCharG = G ([], NonTerminal "char") manyG :: Grammar a -> Grammar [a] manyG (G (prods, rhs)) = G (prods, Repetition rhs) mergeProds :: [Production] -> [Production] -> [Production] mergeProds prods1 prods2 = nub $ prods1 ++ prods2 orElseG :: Grammar a -> Grammar a -> Grammar a orElseG (G (prods1, rhs1)) (G (prods2, rhs2)) = G (mergeProds prods1 prods2, Choice rhs1 rhs2) instance Functor Grammar where fmap _ (G bnf) = G bnf instance Applicative Grammar where pure x = G ([], Terminal "") G (prods1, Terminal "") <*> G (prods2, rhs2) = G (mergeProds prods1 prods2, rhs2) G (prods1, rhs1) <*> G (prods2, Terminal "") = G (mergeProds prods1 prods2, rhs1) G (prods1, rhs1) <*> G (prods2, rhs2) = G (mergeProds prods1 prods2, Sequence rhs1 rhs2) primitiveG :: String -> Grammar a primitiveG s = G (NonTerminal s)

The use of nub when combining productions removes duplicates that might be used in different parts of the grammar. Not efficient, but good enough for now.

Did we gain anything? Not yet:

*Main> putStr $ ppGrammar "csv" (parseCSVG) csv = {('"', {not-quote}, '"', {',', '"', {not-quote}, '"'} | ''), newline};

But we can now introduce a function that lets us tell the system where to give names to a piece of grammar:

nonTerminal :: String -> Grammar a -> Grammar a nonTerminal name (G (prods, rhs)) = G (prods ++ [(name, rhs)], NonTerminal name)

Ample use of this in parseCSVG yields the desired result:

parseCSVG :: Grammar [[String]] parseCSVG = manyG parseLine where parseLine = nonTerminal "line" $ parseCell `sepByG` charG ',' <* newline parseCell = nonTerminal "cell" $ charG '"' *> manyG (primitiveG "not-quote") <* charG '" *Main> putStr $ ppGrammar "csv" (parseCSVG) cell = '"', {not-quote}, '"'; line = (cell, {',', cell} | ''), newline; csv = {line};

This is great!

Unifying parsing and grammar-generating

Note how simliar parseCSVG and parseCSVP are! Would it not be great if we could implement that functionality only once, and get both a parser and a grammar description out of it? This way, the two would never be out of sync!

And surely this must be possible. The tool to reach for is of course to define a type class that abstracts over the parts where Parser and Grammer differ. So we have to identify all functions that are primitive in one of the two worlds, and turn them into type class methods. This includes char and orElse. It includes many, too: Although manyP is not primitive, manyG is. It also includes nonTerminal, which does not exist in the world of parsers (yet), but we need it for the grammars.

The primitiveG function is tricky. We use it in grammars when the code that we might use while parsing is not expressible as a grammar. So the solution is to let it take two arguments: A String, when used as a descriptive non-terminal in a grammar, and a Parser a, used in the parsing code.

Finally, the type classes that we except, Applicative (and thus Functor), are added as constraints on our type class:

class Applicative f => Descr f where char :: Char -> f () many :: f a -> f [a] orElse :: f a -> f a -> f a primitive :: String -> Parser a -> f a nonTerminal :: String -> f a -> f a

The instances are easily written:

instance Descr Parser where char = charP many = manyP orElse = orElseP primitive _ p = p nonTerminal _ p = p instance Descr Grammar where char = charG many = manyG orElse = orElseG primitive s _ = primitiveG s nonTerminal s g = nonTerminal s g

And we can now take the derived definitions, of which so far we had two copies, and define them once and for all:

many1 :: Descr f => f a -> f [a] many1 p = pure (:) <*> p <*> many p anyChar :: Descr f => f Char anyChar = primitive "char" anyCharP dottedWords :: Descr f => f [String] dottedWords = many1 (many anyChar <* char '.') sepBy :: Descr f => f a -> f () -> f [a] sepBy p1 p2 = ((:) <$> p1 <*> (many (p2 *> p1))) `orElse` pure [] newline :: Descr f => f () newline = primitive "newline" (charP '\n')

And thus we now have our CSV parser/grammar generator:

parseCSV :: Descr f => f [[String]] parseCSV = many parseLine where parseLine = nonTerminal "line" $ parseCell `sepBy` char ',' <* newline parseCell = nonTerminal "cell" $ char '"' *> many (primitive "not-quote" (anyCharButP '"')) <* char '"'

We can now use this definition both to parse and to generate grammars:

*Main> putStr $ ppGrammar2 "csv" (parseCSV) cell = '"', {not-quote}, '"'; line = (cell, {',', cell} | ''), newline; csv = {line}; *Main> parse parseCSV "\"ab\",\"cd\"\n\"\",\"de\"\n\n" Just [["ab","cd"],["","de"],[]] The INI file parser and grammar

As a final exercise, let us transform the INI file parser into a combined thing. Here is the parser (another artifact of last week’s homework) again using applicative style2:

parseINIP :: Parser INIFile parseINIP = many1P parseSection where parseSection = (,) <$ charP '[' <*> parseIdent <* charP ']' <* charP '\n' <*> (catMaybes <$> manyP parseLine) parseIdent = many1P letterOrDigitP parseLine = parseDecl `orElseP` parseComment `orElseP` parseEmpty parseDecl = Just <$> ( (,) <*> parseIdent <* manyP (charP ' ') <* charP '=' <* manyP (charP ' ') <*> many1P (anyCharButP '\n') <* charP '\n') parseComment = Nothing <$ charP '#' <* many1P (anyCharButP '\n') <* charP '\n' parseEmpty = Nothing <$ charP '\n'

Transforming that to a generic description is quite straightforward. We use primitive again to wrap letterOrDigitP:

descrINI :: Descr f => f INIFile descrINI = many1 parseSection where parseSection = (,) <* char '[' <*> parseIdent <* char ']' <* newline <*> (catMaybes <$> many parseLine) parseIdent = many1 (primitive "alphanum" letterOrDigitP) parseLine = parseDecl `orElse` parseComment `orElse` parseEmpty parseDecl = Just <$> ( (,) <*> parseIdent <* many (char ' ') <* char '=' <* many (char ' ') <*> many1 (primitive "non-newline" (anyCharButP '\n')) <* newline) parseComment = Nothing <$ char '#' <* many1 (primitive "non-newline" (anyCharButP '\n')) <* newline parseEmpty = Nothing <$ newline

This yields this not very helpful grammar (abbreviated here):

*Main> putStr $ ppGrammar2 "ini" descrINI ini = '[', alphanum, {alphanum}, ']', newline, {alphanum, {alphanum}, {' '}…

But with a few uses of nonTerminal, we get something really nice:

descrINI :: Descr f => f INIFile descrINI = many1 parseSection where parseSection = nonTerminal "section" $ (,) <$ char '[' <*> parseIdent <* char ']' <* newline <*> (catMaybes <$> many parseLine) parseIdent = nonTerminal "identifier" $ many1 (primitive "alphanum" letterOrDigitP) parseLine = nonTerminal "line" $ parseDecl `orElse` parseComment `orElse` parseEmpty parseDecl = nonTerminal "declaration" $ Just <$> ( (,) <$> parseIdent <* spaces <* char '=' <* spaces <*> remainder) parseComment = nonTerminal "comment" $ Nothing <$ char '#' <* remainder remainder = nonTerminal "line-remainder" $ many1 (primitive "non-newline" (anyCharButP '\n')) <* newline parseEmpty = Nothing <$ newline spaces = nonTerminal "spaces" $ many (char ' ') *Main> putStr $ ppGrammar "ini" descrINI identifier = alphanum, {alphanum}; spaces = {' '}; line-remainder = non-newline, {non-newline}, newline; declaration = identifier, spaces, '=', spaces, line-remainder; comment = '#', line-remainder; line = declaration | comment | newline; section = '[', identifier, ']', newline, {line}; ini = section, {section}; Recursion (variant 1)

What if we want to write a parser/grammar-generator that is able to generate the following grammar, which describes terms that are additions and multiplications of natural numbers:

const = digit, {digit}; spaces = {' ' | newline}; atom = const | '(', spaces, expr, spaces, ')', spaces; mult = atom, {spaces, '*', spaces, atom}, spaces; plus = mult, {spaces, '+', spaces, mult}, spaces; expr = plus;

The production of expr is recursive (via plus, mult, atom). We have seen above that simply defining a Grammar a recursively does not go well.

One solution is to add a new combinator for explicit recursion, which replaces nonTerminal in the method:

class Applicative f => Descr f where … recNonTerminal :: String -> (f a -> f a) -> f a instance Descr Parser where … recNonTerminal _ p = let r = p r in r instance Descr Grammar where … recNonTerminal = recNonTerminalG recNonTerminalG :: String -> (Grammar a -> Grammar a) -> Grammar a recNonTerminalG name f = let G (prods, rhs) = f (G ([], NonTerminal name)) in G (prods ++ [(name, rhs)], NonTerminal name) nonTerminal :: Descr f => String -> f a -> f a nonTerminal name p = recNonTerminal name (const p) runGrammer :: String -> Grammar a -> BNF runGrammer main (G (prods, NonTerminal nt)) | main == nt = prods runGrammer main (G (prods, rhs)) = prods ++ [(main, rhs)]

The change in runGrammer avoids adding a pointless expr = expr production to the output.

This lets us define a parser/grammar-generator for the arithmetic expressions given above:

data Expr = Plus Expr Expr | Mult Expr Expr | Const Integer deriving Show mkPlus :: Expr -> [Expr] -> Expr mkPlus = foldl Plus mkMult :: Expr -> [Expr] -> Expr mkMult = foldl Mult parseExpr :: Descr f => f Expr parseExpr = recNonTerminal "expr" $ \ exp -> ePlus exp ePlus :: Descr f => f Expr -> f Expr ePlus exp = nonTerminal "plus" $ mkPlus <$> eMult exp <*> many (spaces *> char '+' *> spaces *> eMult exp) <* spaces eMult :: Descr f => f Expr -> f Expr eMult exp = nonTerminal "mult" $ mkPlus <$> eAtom exp <*> many (spaces *> char '*' *> spaces *> eAtom exp) <* spaces eAtom :: Descr f => f Expr -> f Expr eAtom exp = nonTerminal "atom" $ aConst `orElse` eParens exp aConst :: Descr f => f Expr aConst = nonTerminal "const" $ Const . read <$> many1 digit eParens :: Descr f => f a -> f a eParens inner = id <$ char '(' <* spaces <*> inner <* spaces <* char ')' <* spaces

And indeed, this works:

*Main> putStr $ ppGrammar "expr" parseExpr const = digit, {digit}; spaces = {' ' | newline}; atom = const | '(', spaces, expr, spaces, ')', spaces; mult = atom, {spaces, '*', spaces, atom}, spaces; plus = mult, {spaces, '+', spaces, mult}, spaces; expr = plus; Recursion (variant 2)

Interestingly, there is another solution to this problem, which avoids introducing recNonTerminal and explicitly passing around the recursive call (i.e. the exp in the example). To implement that we have to adjust our Grammar type as follows:

newtype Grammar a = G ([String] -> (BNF, RHS))

The idea is that the list of strings is those non-terminals that we are currently defining. So in nonTerminal, we check if the non-terminal to be introduced is currently in the process of being defined, and then simply ignore the body. This way, the recursion is stopped automatically:

nonTerminalG :: String -> (Grammar a) -> Grammar a nonTerminalG name (G g) = G $ \seen -> if name `elem` seen then ([], NonTerminal name) else let (prods, rhs) = g (name : seen) in (prods ++ [(name, rhs)], NonTerminal name)

After adjusting the other primitives of Grammar (including the Functor and Applicative instances, wich now again have nonTerminal) to type-check again, we observe that this parser/grammar generator for expressions, with genuine recursion, works now:

parseExp :: Descr f => f Expr parseExp = nonTerminal "expr" $ ePlus ePlus :: Descr f => f Expr ePlus = nonTerminal "plus" $ mkPlus <$> eMult <*> many (spaces *> char '+' *> spaces *> eMult) <* spaces eMult :: Descr f => f Expr eMult = nonTerminal "mult" $ mkPlus <$> eAtom <*> many (spaces *> char '*' *> spaces *> eAtom) <* spaces eAtom :: Descr f => f Expr eAtom = nonTerminal "atom" $ aConst `orElse` eParens parseExp

Note that the recursion is only going to work if there is at least one call to nonTerminal somewhere around the recursive calls. We still cannot implement many as naively as above.

Homework

If you want to play more with this: The homework is to define a parser/grammar-generator for EBNF itself, as specified in this variant:

identifier = letter, {letter | digit | '-'}; spaces = {' ' | newline}; quoted-char = non-quote-or-backslash | '\\', '\\' | '\\', '\''; terminal = '\'', {quoted-char}, '\'', spaces; non-terminal = identifier, spaces; option = '[', spaces, rhs, spaces, ']', spaces; repetition = '{', spaces, rhs, spaces, '}', spaces; group = '(', spaces, rhs, spaces, ')', spaces; atom = terminal | non-terminal | option | repetition | group; sequence = atom, {spaces, ',', spaces, atom}, spaces; choice = sequence, {spaces, '|', spaces, sequence}, spaces; rhs = choice; production = identifier, spaces, '=', spaces, rhs, ';', spaces; bnf = production, {production};

This grammar is set up so that the precedence of , and | is correctly implemented: a , b | c will parse as (a, b) | c.

In this syntax for BNF, terminal characters are quoted, i.e. inside '…', a ' is replaced by \' and a \ is replaced by \\ – this is done by the function quote in ppRHS.

If you do this, you should able to round-trip with the pretty-printer, i.e. parse back what it wrote:

*Main> let bnf1 = runGrammer "expr" parseExpr *Main> let bnf2 = runGrammer "expr" parseBNF *Main> let f = Data.Maybe.fromJust . parse parseBNF. ppBNF *Main> f bnf1 == bnf1 True *Main> f bnf2 == bnf2 True

The last line is quite meta: We are using parseBNF as a parser on the pretty-printed grammar produced from interpreting parseBNF as a grammar.

Conclusion

We have again seen an example of the excellent support for abstraction in Haskell: Being able to define so very different things such as a parser and a grammar description with the same code is great. Type classes helped us here.

Note that it was crucial that our combined parser/grammars are only able to use the methods of Applicative, and not Monad. Applicative is less powerful, so by giving less power to the user of our Descr interface, the other side, i.e. the implementation, can be more powerful.

The reason why Applicative is ok, but Monad is not, is that in Applicative, the results do not affect the shape of the computation, whereas in Monad, the whole point of the bind operator (>>=) is that the result of the computation is used to decide the next computation. And while this is perfectly fine for a parser, it just makes no sense for a grammar generator, where there simply are no values around!

We have also seen that a phantom type, namely the parameter of Grammar, can be useful, as it lets the type system make sure we do not write nonsense. For example, the type of orElseG ensures that both grammars that are combined here indeed describe something of the same type.

  1. It seems to be the week of applicative-appraising blog posts: Brent has posted a nice piece about enumerations using Applicative yesterday.

  2. I like how in this alignment of <*> and <* the > point out where the arguments are that are being passed to the function on the left.

Categories: Offsite Blogs

Brent Yorgey: Adventures in enumerating balanced brackets

Planet Haskell - Mon, 10/24/2016 - 10:42pm

Since I’ve been coaching my school’s ACM ICPC programming team, I’ve been spending a bit of time solving programming contest problems, partly to stay sharp and be able to coach them better, but also just for fun.

I recently solved a problem (using Haskell) that ended up being tougher than I thought, but I learned a lot along the way. Rather than just presenting a solution, I’d like to take you through my thought process, crazy detours and all.

Of course, I should preface this with a big spoiler alert: if you want to try solving the problem yourself, you should stop reading now!

> {-# LANGUAGE GADTs #-} > {-# LANGUAGE DeriveFunctor #-} > > module Brackets where > > import Data.List (sort, genericLength) > import Data.MemoTrie (memo, memo2) > import Prelude hiding ((++)) The problem

There’s a lot of extra verbiage at the official problem description, but what it boils down to is this:

Find the th element of the lexicographically ordered sequence of all balanced bracketings of length .

There is a longer description at the problem page, but hopefully a few examples will suffice. A balanced bracketing is a string consisting solely of parentheses, in which opening and closing parens can be matched up in a one-to-one, properly nested way. For example, there are five balanced bracketings of length :

((())), (()()), (())(), ()(()), ()()()

By lexicographically ordered we just mean that the bracketings should be in “dictionary order” where ( comes before ), that is, bracketing comes before bracketing if and only if in the first position where they differ, has ( and has ). As you can verify, the list of length- bracketings above is, in fact, lexicographically ordered.

A first try

Oh, this is easy, I thought, especially if we consider the well-known isomorphism between balanced bracketings and binary trees. In particular, the empty string corresponds to a leaf, and (L)R (where L and R are themselves balanced bracketings) corresponds to a node with subtrees L and R. So the five balanced bracketings of length correspond to the five binary trees with three nodes:

We can easily generate all the binary trees of a given size with a simple recursive algorithm. If , generate a Leaf; otherwise, decide how many nodes to put on the left and how many on the right, and for each such distribution recursively generate all possible trees on the left and right.

> data Tree where > Leaf :: Tree > Node :: Tree -> Tree -> Tree > deriving (Show, Eq, Ord) > > allTrees :: Int -> [Tree] > allTrees 0 = [Leaf] > allTrees n = > [ Node l r > | k <- [0 .. n-1] > , l <- allTrees ((n-1) - k) > , r <- allTrees k > ]

We generate the trees in “left-biased” order, where we first choose to put all nodes on the left, then on the left and on the right, and so on. Since a subtree on the left will result in another opening paren, but a subtree on the right will result in a closing paren followed by an open paren, it makes intuitive sense that this corresponds to generating bracketings in sorted order. You can see that the size- trees above, generated in left-biased order, indeed have their bracketings sorted.

Writing allTrees is easy enough, but it’s definitely not going to cut it: the problem states that we could have up to . The number of trees with nodes has 598 digits (!!), so we can’t possibly generate the entire list and then index into it. Instead we need a function that can more efficiently generate the tree with a given index, without having to generate all the other trees before it.

So I immediately launched into writing such a function, but it’s tricky to get right. It involves computing Catalan numbers, and cumulative sums of products of Catalan numbers, and divMod, and… I never did get that function working properly.

The first epiphany

But I never should have written that function in the first place! What I should have done first was to do some simple tests just to confirm my intuition that left-biased tree order corresponds to sorted bracketing order. Because if I had, I would have found this:

> brackets :: Tree -> String > brackets Leaf = "" > brackets (Node l r) = mconcat ["(", brackets l, ")", brackets r] > > sorted :: Ord a => [a] -> Bool > sorted xs = xs == sort xs ghci> sorted (map brackets (allTrees 3)) True ghci> sorted (map brackets (allTrees 4)) False

As you can see, my intuition actually led me astray! is a small enough case that left-biased order just happens to be the same as sorted bracketing order, but for this breaks down. Let’s see what goes wrong:

In the top row are the size- trees in “left-biased” order, i.e. the order generated by allTrees. You can see it is nice and symmetric: reflecting the list across a vertical line leaves it unchanged. On the bottom row are the same trees, but sorted lexicographically by their bracketings. You can see that the lists are almost the same except the red tree is in a different place. The issue is the length of the left spine: the red tree has a left spine of three nodes, which means its bracketing will begin with (((, so it should come before any trees with a left spine of length 2, even if they have all their nodes in the left subtree (whereas the red tree has one of its nodes in the right subtree).

My next idea was to try to somehow enumerate trees in order by the length of their left spine. But since I hadn’t even gotten indexing into the original left-biased order to work, it seemed hopeless to get this to work by implementing it directly. I needed some bigger guns.

Building enumerations

At this point I had the good idea to introduce some abstraction. I defined a type of enumerations (a la FEAT or data/enumerate):

> data Enumeration a = Enumeration > { fromNat :: Integer -> a > , size :: Integer > } > deriving Functor > > enumerate :: Enumeration a -> [a] > enumerate (Enumeration f n) = map f [0..n-1]

An Enumeration consists of a size along with a function Integer -> a, which we think of as being defined on [0 .. size-1]. That is, an Enumeration is isomorphic to a finite list of a given length, where instead of explicitly storing the elements, we have a function which can compute the element at a given index on demand. If the enumeration has some nice combinatorial structure, then we expect that this on-demand indexing can be done much more efficiently than simply listing all the elements. The enumerate function simply turns an Enumeration into the corresponding finite list, by mapping the indexing function over all possible indices.

Note that Enumeration has a natural Functor instance, which GHC can automatically derive for us. Namely, if e is an Enumeration, then fmap f e is the Enumeration which first computes the element of e for a given index, and then applies f to it before returning.

Now, let’s define some combinators for building Enumerations. We expect them to have all the nice algebraic flavor of finite lists, aka free monoids.

First, we can create empty or singleton enumerations, or convert any finite list into an enumeration:

> empty :: Enumeration a > empty = Enumeration (const undefined) 0 > > singleton :: a -> Enumeration a > singleton a = Enumeration (\_ -> a) 1 > > list :: [a] -> Enumeration a > list as = Enumeration (\n -> as !! fromIntegral n) (genericLength as) ghci> enumerate (empty :: Enumeration Int) [] ghci> enumerate (singleton 3) [3] ghci> enumerate (list [4,6,7]) [4,6,7]

We can form the concatenation of two enumerations. The indexing function compares the given index against the size of the first enumeration, and then indexes into the first or second enumeration appropriately. For convenience we can also define union, which is just an iterated version of (++).

> (++) :: Enumeration a -> Enumeration a -> Enumeration a > e1 ++ e2 = Enumeration > (\n -> if n < size e1 then fromNat e1 n else fromNat e2 (n - size e1)) > (size e1 + size e2) > > union :: [Enumeration a] -> Enumeration a > union = foldr (++) empty ghci> enumerate (list [3, 5, 6] ++ empty ++ singleton 8) [3,5,6,8]

Finally, we can form a Cartesian product: e1 >< e2 is the enumeration of all possible pairs of elements from e1 and e2, ordered so that all the pairs formed from the first element of e1 come first, followed by all the pairs with the second element of e1, and so on. The indexing function divides the given index by the size of e2, and uses the quotient to index into e1, and the remainder to index into e2.

> (><) :: Enumeration a -> Enumeration b -> Enumeration (a,b) > e1 >< e2 = Enumeration > (\n -> let (l,r) = n `divMod` size e2 in (fromNat e1 l, fromNat e2 r)) > (size e1 * size e2) ghci> enumerate (list [1,2,3] >< list [10,20]) [(1,10),(1,20),(2,10),(2,20),(3,10),(3,20)] ghci> let big = list [0..999] >< list [0..999] >< list [0..999] >< list [0..999] ghci> fromNat big 2973428654 (((2,973),428),654)

Notice in particular how the fourfold product of list [0..999] has elements, but indexing into it with fromNat is basically instantaneous.

Since Enumerations are isomorphic to finite lists, we expect them to have Applicative and Monad instances, too. First, the Applicative instance is fairly straightforward:

> instance Applicative Enumeration where > pure = singleton > f <*> x = uncurry ($) <$> (f >< x) ghci> enumerate $ (*) <$> list [1,2,3] <*> list [10, 100] [10,100,20,200,30,300]

pure creates a singleton enumeration, and applying an enumeration of functions to an enumeration of arguments works by taking a Cartesian product and then applying each pair.

The Monad instance works by substitution: in e >>= k, the continuation k is applied to each element of the enumeration e, and the resulting enumerations are unioned together in order.

> instance Monad Enumeration where > return = pure > e >>= f = union (map f (enumerate e)) ghci> enumerate $ list [1,2,3] >>= \i -> list (replicate i i) [1,2,2,3,3,3]

Having to actually enumerate the elements of e is a bit unsatisfying, but there is really no way around it: we otherwise have no way to know how big the resulting enumerations are going to be.

Now, that function I tried (and failed) to write before that generates the tree at a particular index in left-biased order? Using these enumeration combinators, it’s a piece of cake. Basically, since we built up combinators that mirror those available for lists, it’s just as easy to write this indexing version as it is to write the original allTrees function (which I’ve copied below for comparison):

allTrees :: Int -> [Tree] allTrees 0 = [Leaf] allTrees n = [ Node l r | k <- [0 .. n-1] , l <- allTrees ((n-1) - k) , r <- allTrees k ] > enumTrees :: Int -> Enumeration Tree > enumTrees 0 = singleton Leaf > enumTrees n = union > [ Node <$> enumTrees (n-k-1) <*> enumTrees k > | k <- [0 .. n-1] > ]

(enumTrees and allTrees look a bit different, but actually allTrees can be rewritten in a very similar style:

allTrees :: Int -> [Tree] allTrees 0 = [Leaf] allTrees n = concat [ Node <$> allTrees ((n-1) - k) <*> r <- allTrees k | k <- [0 .. n-1] ]

Doing as much as possible using the Applicative interface gives us added “parallelism”, which in this case means the ability to index directly into a product with divMod, rather than scanning through the results of calling a function on enumerate until we have accumulated the right size. See the paper on the GHC ApplicativeDo extension.)

Let’s try it out:

ghci> enumerate (enumTrees 3) [Node (Node (Node Leaf Leaf) Leaf) Leaf,Node (Node Leaf (Node Leaf Leaf)) Leaf,Node (Node Leaf Leaf) (Node Leaf Leaf),Node Leaf (Node (Node Leaf Leaf) Leaf),Node Leaf (Node Leaf (Node Leaf Leaf))] ghci> enumerate (enumTrees 3) == allTrees 3 True ghci> enumerate (enumTrees 7) == allTrees 7 True ghci> brackets $ fromNat (enumTrees 7) 43 "((((()())))())"

It seems to work! Though actually, if we try larger values of , enumTrees just seems to hang. The problem is that it ends up making many redundant recursive calls. Well… nothing a bit of memoization can’t fix! (Here I’m using Conal Elliott’s nice MemoTrie package.)

> enumTreesMemo :: Int -> Enumeration Tree > enumTreesMemo = memo enumTreesMemo' > where > enumTreesMemo' 0 = singleton Leaf > enumTreesMemo' n = union > [ Node <$> enumTreesMemo (n-k-1) <*> enumTreesMemo k > | k <- [0 .. n-1] > ] ghci> size (enumTreesMemo 10) 16796 ghci> size (enumTreesMemo 100) 896519947090131496687170070074100632420837521538745909320 ghci> size (enumTreesMemo 1000) 2046105521468021692642519982997827217179245642339057975844538099572176010191891863964968026156453752449015750569428595097318163634370154637380666882886375203359653243390929717431080443509007504772912973142253209352126946839844796747697638537600100637918819326569730982083021538057087711176285777909275869648636874856805956580057673173655666887003493944650164153396910927037406301799052584663611016897272893305532116292143271037140718751625839812072682464343153792956281748582435751481498598087586998603921577523657477775758899987954012641033870640665444651660246024318184109046864244732001962029120 ghci> brackets $ fromNat (enumTreesMemo 1000) 8234587623904872309875907638475639485792863458726398487590287348957628934765 "((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((()(((()((((()))())(()()()))()(())(())((()((()))(((())()(((((()(((()()))(((()((((()()(())()())(((()))))(((()()()(()()))))(((()((()))(((()())())))())(()()(())(())()(()())))()))((()()))()))()))()(((()))(()))))))())()()()))((())((()))((((())(())))((())))))()))()(())))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))"

That’s better!

A second try

At this point, I thought that I needed to enumerate trees in order by the length of their left spine. Given a tree with a left spine of length , we enumerate all the ways to partition the remaining elements among the right children of the spine nodes, preferring to first put elements as far to the left as possible. As you’ll see, this turns out to be wrong, but it’s fun to see how easy it is to write this using the enumeration framework.

First, we need an enumeration of the partitions of a given into exactly parts, in lexicographic order.

> kPartitions :: Int -> Int -> Enumeration [Int]

There is exactly one way to partition into zero parts.

> kPartitions 0 0 = singleton []

We can’t partition anything other than into zero parts.

> kPartitions _ 0 = empty

Otherwise, pick a number from down to to go in the first spot, and then recursively enumerate partitions of into exactly parts.

> kPartitions n k = do > i <- list [n, n-1 .. 0] > (i:) <$> kPartitions (n-i) (k-1)

Let’s try it:

ghci> let p43 = enumerate $ kPartitions 4 3 ghci> p43 [[4,0,0],[3,1,0],[3,0,1],[2,2,0],[2,1,1],[2,0,2],[1,3,0],[1,2,1],[1,1,2],[1,0,3],[0,4,0],[0,3,1],[0,2,2],[0,1,3],[0,0,4]] ghci> all ((==3) . length) p43 True ghci> all ((==4) . sum) p43 True ghci> sorted (reverse p43) True

Now we can use kPartitions to build our enumeration of trees:

> spinyTrees :: Int -> Enumeration Tree > spinyTrees = memo spinyTrees' > where > spinyTrees' 0 = singleton Leaf > spinyTrees' n = do > > -- Pick the length of the left spine > spineLen <- list [n, n-1 .. 1] > > -- Partition the remaining elements among the spine nodes > bushSizes <- kPartitions (n - spineLen) spineLen > bushes <- traverse spinyTrees bushSizes > return $ buildSpine (reverse bushes) > > buildSpine :: [Tree] -> Tree > buildSpine [] = Leaf > buildSpine (b:bs) = Node (buildSpine bs) b

This appears to give us something reasonable:

ghci> size (spinyTrees 7) == size (enumTreesMemo 7) True

But it’s pretty slow—which is to be expected with all those monadic operations required. And there’s more:

ghci> sorted . map brackets . enumerate $ spinyTrees 3 True ghci> sorted . map brackets . enumerate $ spinyTrees 4 True ghci> sorted . map brackets . enumerate $ spinyTrees 5 False

Foiled again! All we did was stave off failure a bit, until . I won’t draw all the trees of size for you, but the failure mode is pretty similar: picking subtrees for the spine based just on how many elements they have doesn’t work, because there are cases where we want to first shift some elements to a later subtree, keeping the left spine of a subtree, before moving the elements back and having a shorter left spine.

The solution: just forget about trees, already

It finally occurred to me that there was nothing in the problem statement that said anything about trees. That was just something my overexcited combinatorial brain imposed on it: obviously, since there is a bijection between balanced bracketings and binary trees, we should think about binary trees, right? …well, there is also a bijection between balanced bracketings and permutations avoiding (231), and lattice paths that stay above the main diagonal, and hundreds of other things, so… not necessarily.

In this case, I think trees just end up making things harder. Let’s think instead about enumerating balanced bracket sequences directly. To do it recursively, we need to know how to enumerate possible endings to the start of any balanced bracket sequence. That is, we need to enumerate sequences containing opening brackets and extra closing brackets (so closing brackets in total), which can be appended to a sequence of brackets with more opening brackets than closing brackets.

Given this idea, the code is fairly straightforward:

> enumBrackets :: Int -> Enumeration String > enumBrackets n = enumBracketsTail n 0 > > enumBracketsTail :: Int -> Int -> Enumeration String > enumBracketsTail = memo2 enumBracketsTail' > where

To enumerate a sequence with no opening brackets, just generate c closing brackets.

> enumBracketsTail' 0 c = singleton (replicate c ')')

To enumerate balanced sequences with opening brackets and an exactly matching number of closing brackets, start by generating an opening bracket and then continue by generating sequences with opening brackets and one extra closing bracket to match the opening bracket we started with.

> enumBracketsTail' n 0 = ('(':) <$> enumBracketsTail (n-1) 1

In general, a sequence with opening and extra closing brackets is either an opening bracket followed by an (n-1, c+1)-sequence, or a closing bracket followed by an (n, c-1)-sequence.

> enumBracketsTail' n c = > (('(':) <$> enumBracketsTail (n-1) (c+1)) > ++ > ((')':) <$> enumBracketsTail n (c-1))

This is quite fast, and as a quick check, it does indeed seem to give us the same size enumerations as the other tree enumerations:

ghci> fromNat (enumBrackets 40) 16221270422764920820 "((((((((()((())()(()()()())(()))((()()()()(()((()())))((()())))))))()))()())()))" ghci> size (enumBrackets 100) == size (enumTreesMemo 100) True

But, are they sorted? It would seem so!

ghci> all sorted (map (enumerate . enumBrackets) [1..10]) True

At this point, you might notice that this can be easily de-abstracted into a fairly simple dynamic programming solution, using a 2D array to keep track of the size of the enumeration for each (n,c) pair. I’ll leave the details to interested readers.


Categories: Offsite Blogs

Douglas M. Auclair (geophf): September 2016 1HaskellADay 1Liners Problems and Solutions

Planet Haskell - Sat, 10/22/2016 - 8:00am
  • September 15th, 2016:
    Given [1..n], create an infinite list of lists [[1.. n], [n+1 ... n+n], [n+n+1 ... 3n], ...]
    counting :: [Integer] -> [[Integer]]
    • joomy @cattheory
      counting = (map . (+) . fromIntegral . length) >>= iterate
  • September 30th, 2016: The reverse of August's one-liner:
    f :: (Maybe a, b) -> Maybe (a,b)
    define f. Snaps for elegance.
Categories: Offsite Blogs

Edwin Brady: State Machines All The Way Down

Planet Haskell - Fri, 10/21/2016 - 5:48pm

A new draft paper, State Machines All The Way Down, which describes an architecture for dependently typed functional programs. Abstract:

A useful pattern in dependently typed programming is to define a state transition system, for example the states and operations in a network protocol, as a parameterised monad. We index each operation by its input and output states, thus guaranteeing that operations satisfy pre- and post-conditions, by typechecking. However, what if we want to write a program using several systems at once? What if we want to define a high level state transition system, such as a network application protocol, in terms of lower level states, such as network sockets and mutable variables? In this paper, I present an architecture for dependently typed applications based on a hierarchy of state transition systems, implemented as a library called states. Using states, I show: how to implement a state transition system as a dependent type, with type level guarantees on its operations; how to account for operations which could fail; how to combine state transition systems into a larger system; and, how to implement larger systems as a hierarchy of state transition systems. As an example, I implement a simple high level network application protocol.

Comments welcome! You can get the draft here.


Categories: Offsite Blogs

Roman Cheplyaka: Mean-variance ceiling

Planet Haskell - Thu, 10/20/2016 - 2:00pm

Today I was playing with the count data from a small RNA-Seq experiment performed in Arabidopsis thaliana.

At some point, I decided to look at the mean-variance relationship for the fragment counts. As I said, the dataset is small; there are only 3 replicates per condition from which to estimate the variance. Moreover, each sample is from a different batch. I wasn’t expecting to see much.

But there was a pattern in the mean-variance plot that was impossible to miss.

<figure> <figcaption>Mean-variance plot of counts per million, log-log scale</figcaption> </figure>

It is a nice straight line that many points lie on, but none dare to cross. A ceiling.

The ceiling looked mysterious at first, but then I found a simple explanation. The sample variance of \(n\) numbers \(a_1,\ldots,a_n\) can be written as

\[\sigma^2=\frac{n}{n-1}\left(\frac1n\sum_{i=1}^n a_i^2-\mu^2\right),\]

where \(\mu\) is the sample mean. Thus,

\[\frac{\sigma^2}{\mu^2}=\frac{\sum a_i^2}{(n-1)\mu^2}-\frac{n}{n-1}.\]

For non-negative numbers, \(n^2\mu^2=(\sum a_i)^2\geq \sum a_i^2\), and

\[\frac{\sigma^2}{\mu^2}\leq\frac{n^2}{n-1}-\frac{n}{n-1}=n.\]

This means that on a log-log plot, all points \((\mu,\sigma^2)\) lie on or below the line \(y=2x+\log n\).

Moreover, the points that lie exactly on the line correspond to the samples where all \(a_i\) but one are zero. In other words, those are gene-condition combinations where the gene’s transcripts were registered in a single replicate for that condition.

Categories: Offsite Blogs

Roman Cheplyaka: The rule of 17 in volleyball

Planet Haskell - Wed, 10/19/2016 - 2:00pm

Scott Adams, the author of Dilbert, writes in his book “How to Fail at Almost Everything and Still Win Big”:

Recently I noticed that the high-school volleyball games I attended in my role as stepdad were almost always won by the team that reached seventeen first, even though the winning score is twenty-five and you have to win by two.

It’s common for the lead to change often during a volleyball match, and the team that first reaches seventeen might fall behind a few more times before winning, which makes the pattern extra strange.

Good observation, Scott! But why could it be so?

Scott offers two possible explanations. One is psychological: the leading team has a higher morale while the losing team feels defeated. The other is that perhaps the coach of the losing team sees this as an opportunity to let his bench players on the court.

While these reasons sound plausible to me, there is a simpler logical explanation. It would hold even if the players and coaches were robots.

Imagine that you enter a gym where a game is being played. You see the current score: 15:17. If you know nothing else about the teams except their current score, which one do you think is more likely to win the set?

There are two reasons to think it is the leading team:

  1. The score by itself doesn’t offer much evidence that the leading team is stronger or in a better shape. However, if one of the teams is stronger, it is more likely to be the leading team.
  2. Even without assuming anything about how good the teams are, the leading team at this moment is up for an easier task: it needs only 8 points to win, whereas the team behind needs 10 points.

To quantify the reliability of Scott Adams’s “rule of 17”, I wrote a simple simulation in R:

sim.one <- function(prob, threshold) { score <- c(0,0) leader <- NA serving <- 1 while (all(score < 25) || abs(diff(score)) < 2) { winner <- if (as.logical(rbinom(1,1,prob[[serving]]))) serving else 3 - serving score[[winner]] <- score[[winner]] + 1 serving <- winner if (is.na(leader) && any(score == threshold)) { leader <- which.max(score) } } return(c(leader, which.max(score))) }

Here prob is a 2-dimensional vector \((p_1,p_2)\), where \(p_i\) is the probability of team \(i\) to win their serve against the opposing team. The function simulates a single set and returns two numbers: which team first scored threshold (e.g. 17) points and which team eventually won. If the two numbers are equal, the rule worked in this game.

Then I simulated a game 1000 times for each of many combinations of \(p_1\) and \(p_2\) and calculated the fraction of the games where the rule worked. Here’s the result:

<figure> </figure>

When \(p_1=p_2\), the reliability of the rule is independent of the values of \(p_1\) and \(p_2\) (within the tested limits of \(0.3\) and \(0.7\)) and equals approximately \(81\%\). This is entirely due to reason 2: all else being equal, the leading team has a head start.

When teams are unequal, reason 1 kicks in, and for large inequalities, the reliability of the rule approaches \(1\). For instance, when \(p_1=0.3\) and \(p_2=0.7\), the rule works about \(99\%\) of the time.

Is there anything magical about the number 17? No, we would expect the rule to work for any threshold at least to some extent. The reliability would grow from somewhere around \(50\%\) for the threshold of \(1\) to almost \(100\%\) for the threshold of \(25\).

And indeed, this is what we observe (for \(p_1=p_2\)):

<figure> </figure>

This reminds me of men’s gold medal match at the 2012 London Olympics, where Russia played against Brazil. Russia loses the first two sets. A game lasts until one of the teams wins 3 sets in total, so Russia cannot afford to lose a single set now. In the third set, Brazil continues to lead, reaching 17 (and then 18) points while Russia has 15. Several minutes later, Brazil leads 22:19.

And then, against all odds, the Russian team wins that set 29:27, then the two following sets, and gets the gold.

<figure> <figcaption>Dmitriy Muserskiy is about to score the gold medal point</figcaption> </figure>
Categories: Offsite Blogs

Philip Wadler: Papers We Love Remote Meetup: John Reynolds, Definitional Interpreters for Higher-Order Languages

Planet Haskell - Tue, 10/18/2016 - 8:34am

I will reprise my June presentation to Papers We Love London at Papers We Love Remote Meetup 2, today at 7pm UK time, with the subject John Reynolds, Definitional Interpreters for Higher-Order Languages. Learn the origins of denotational semantics and continuations. Additional citations here. See you there!

Categories: Offsite Blogs

Ken T Takusagawa: [uitadwod] Stackage

Planet Haskell - Sun, 10/16/2016 - 11:33pm

Stackage for Haskell packages has the curious behavior that packages can disappear from it even though they were perfectly fine.  The cause of such a disappearance of say a package B is as follows: package B was originally pulled in as a dependency of another package A, and the maintainer of package A quit, so package A and all its dependencies, including package B, are candidates to be removed from Stackage.  Package B survives only if it has a direct maintainer in Stackage or is a dependency of another maintained package.

Inspired by the many packages that got dropped when lambdabot got removed from Stackage nightly, e.g., brainfuck.

Although the stated goal of Stackage is a curated collection of Haskell packages, each with an explicit maintainer willing to fix bugs and compilation problems (e.g., with new versions of GHC), I have found that a side feature is more useful: the identification of a large mutually compatible collection of packages without version dependency problems.  Such a side feature -- such a collection -- could be computed automatically without having to have a direct or indirect maintainer for each package in the collection.  I wish such a larger collection existed.

Start with, say, Stackage Nightly and expand it to include every package in Hackage that compiles cleanly and is compatible with Stackage Nightly and with every other package in the expanded collection.  There may be tricky cases of mutually incompatible packages in a potential expanded set which will need to be resolved, e.g., the newest version of A requires an old version of B, and the newest version of B requires an old version of A.  Perhaps resolve such conflicts in favor of the choice which causes the expanded set to be as large as possible.

Tangentially, how can one safely build a package (to test whether it compiles cleanly) if one is not sure whether a package's build script is evil?  Probably some kind of operating system container or sandbox.  Identify packages which use simple, presumably safe, build mechanisms, probably pure Haskell, versus packages which do something unusual, e.g., call a Makefile, which ought to be scrutinized before building.  (Inspired by a build script of software, I think maxima computer algebra, which creepily attempted to send email back to the author every time it was compiled.)

Can compiling a carefully crafted source file with GHC allow the author of the source file to perform arbitrary user-level actions within the operating system?

Categories: Offsite Blogs

Tweag I/O: A new ecosystem for Haskell: the JVM

Planet Haskell - Sun, 10/16/2016 - 6:00pm
Mathieu Boespflug, Alp Mestanogullari   |   17 October 2016

By now, Haskell has first class support for seamlessly embedding foreign code into source files and casually call anything in C (via inline-c) or R (via inline-r), let alone that whole programs can also be compiled down to in-browser JavaScript, thanks to GHCJS. Today the interoperability story for Haskell is getting better still: we’re announcing the addition of a new set of languages into the mix. With jvm, you can call any method known to the JVM from Haskell. With inline-java, you can moreover call these methods in Java syntax, embedded in your source files. Not that this was particularly our intention - promise! inline-java and friends just fell out naturally from our work on sparkle

To give you a taste of what it’s like to program this way, here’s an obligatory “Hello World!” in Haskell, but with a twist: we call the Swing GUI toolkit to display our message to the world in a graphical dialog box.

A Swing GUI application in Haskell

{-# LANGUAGE DataKinds #-} {-# LANGUAGE QuasiQuotes #-} {-# LANGUAGE ScopedTypeVariables #-} module Main where import Data.Int import Language.Java import Language.Java.Inline main :: IO Int32 main = withJVM [] $ do message <- reflect "Hello World!" [java| { javax.swing.JOptionPane.showMessageDialog(null, $message); return 0; } |]

In short, it’s now possible to write Java programs that call into Haskell with near trivial overhead, as we demonstrated previously with sparkle, or indeed Haskell programs that call into any of the hundreds of thousands of JVM packages available publicly and boundless custom enterprise developments.

How it works

The key enabler to talking to Java and all the other JVM languages from Haskell is that Haskell speaks C and the JVM speaks C. That is, both Haskell and the JVM make it possible to call C functions and have C functions call them. In both cases, this is done as part of their respective foreign function interfaces (FFI). So we have a lingua franca for both languages: to move from Haskell to Java or vice versa, go through C first. Each language has their own custom calling convention anyways, so some small amount of glue code to mediate between the two is inevitable.

In fact, in the case of the JVM, bytecode is compiled “just-in-time” or perhaps even not at all. Fortunately, that’s not something we have to worry about: the JVM’s standard interface from C, called the Java Native Interface (JNI), encapsulates all the nitty-gritty detail of invoking methods behind a simple interface. As a first step, we wrote near complete bindings to all of the JNI, using inline-c under the hood for better safety.

Calling into the JVM, the hard way

We could just expose the raw JNI API in Haskell and call it a day. Using raw JNI calls to invoke say a static Java method called foo, which takes an int and a boolean and returns some object, goes something like this:

import Foreign.JNI callFoo = do klass <- findClass "some/Java/Class" -- JNI uses '/' instead of '.'... method <- getStaticMethodID klass "foo" "(IZ)Ljava/lang/Object;" callStaticObjectMethod klass method [JInt 0, JBoolean 1]

Because the JVM allows overloaded method names, when grabbing a handle to invoke a method, you’ll need to specify a type signature to disambiguate which method you really want to call. But the JNI was purposefully designed independently of Java’s syntax, to the point where even class names are written differently. The JNI syntax for type signatures is optimized for speed of parsing and compactness, not legibility. So constructing these type signatures by hand to invoke JVM methods via raw JNI calls is rather error prone. That’s why we wrote the jvm package, a toolkit for invoking JVM methods more conveniently and robustly.

Using Haskell types for safer JVM calls

There are two downsides to the raw JNI calls we saw above:

  • performance: getting class and method handles is expensive. Ideally, we’d only ever lookup classes and methods by name at most once throughout the lifetime of the program, assuming that loaded classes exist for all time and are never redefined.
  • stringly typing: we pass signatures explicitly, but these are literally strings, typos and all. If you mistype the signature, no compiler will call that out. Ideally ill-formed signatures would be caught at compile-time, rather than at runtime when it’s far too late and your program will simply crash.

The performance issue is easily dispensed with. The trick is to write wrappers that tell Haskell that findClass and getStaticMethodID are really pure, in the sense that calling either of them multiple times and in any context always yields equivalent results. So we could in principle ascribe pure types to them. The argument goes something like the following. Compare the following snippet with the one above:

callFoo = do let pureStuff@(klass, method) = unsafePerformIO $ do (,) <$> findClass "some/Java/Class" <*> getStaticMethodID klass "foo" "(IZ)Ljava/lang/Object;" callStaticObjectMethod klass method [JInt 0, JBoolean 1]

The expression for pureStuff is a closed expression (no free variables occur). And because its type is not IO, the compiler is free to float it to top-level, effectively turning it into a CAF, which are always evaluated at most once thanks to laziness:

(klass, method) = unsafePerformIO $ do (,) <$> findClass "some/Java/Class" <*> getStaticMethodID klass "foo" "(IZ)Ljava/lang/Object;" callFoo = do callStaticObjectMethod klass method [JInt 0, JBoolean 1]

As for the stringly typing problem, we’ll need some tools first. First, we need to reflect in Haskell enough type information. To that end, we’ll index the type of Java objects by their Java type:

newtype J (a :: JType) = J (Ptr (J a))

Java types can either be primitives (int,boolean, etc) or reference types (classes, arrays, interfaces, generics etc). So our definition of JType goes something like this:

data JType = Prim Symbol | Class Symbol | Array JType | ... genSingletons ['JType]

Thus equipped, we can write types like,

  • the type of Swing option panes, J ('Class "javax.swing.JOptionPane")
  • the type of boxed Java integers, J ('Class "java.lang.Integer"),
  • the type of primitive integer arrays, J ('Array ('Prim "int")),
  • etc.

What’s more, thanks to the family of singleton types and instances created by genSingletons above, we can reflect on the type of any Java object at runtime to get a representation of the type at the value level. This is helpful to auto compute JNI type signatures from the types alone. No more stringly typing will all those typos in tow: JNI type signatures are now correct by construction.

In particular, we can define a family of variants of callStaticObjectMethod:

module Language.Java where callStatic1 :: (SingI ty1, SingI tyr) => Sing (klass :: Symbol) -> JNI.String -> J ty1 -> IO (J tyr) callStatic2 :: (SingI ty1, SingI ty2, SingI tyr) => Sing (klass :: Symbol) -> JNI.String -> J ty1 -> J ty2 -> IO (J tyr) callStatic3 :: (SingI ty1, SingI ty2, SingI ty3, SingI tyr) => Sing (klass :: Symbol) -> JNI.String -> J ty1 -> J ty2 -> J ty3 -> IO (J tyr) ...

The types of these functions are expressive enough to infer a type signature for the called method. Thanks to the type reflection provided by the singletons package, we can reify types as values and produce JNI type signatures from that. Of course, a fixed number of callStatic* functions, one per arity, is rather limiting (what about arbitrary arities?), so in reality the Language.Java module provides a single such function, to whom arguments are passed packed into a homogeneous list:

callStatic :: SingI tyr => Sing (klass :: Symbol) -> JNI.String -> [JValue] -> IO (J tyr)

where JValue is defined as

data JValue = forall a. SingI a => JObject (J a) | JBoolean Word8 | JInt Int32 | ...

In this way, values of primitive type can be passed to Java without penalty: no need to box them into tiny objects first. It turns out we can extend the same idea to obtain unboxed return values, but the technical details get a bit more intricate, so we’ll have to defer that to the module’s documentation.

Calling a non-static method is achieved in much the same way:

call :: (SingI ty, SingI tyr) => J ty -> JNI.String -> [JValue] -> IO (J tyr) JVM calls the Java way

call and callStatic are surprisingly versatile facilities for calling arbitrary JVM methods with an arbitrary number of boxed or unboxed arguments and return values, but sometimes one might still get the types wrong. For example, there’s nothing stopping us from attempting to call a java.lang.Integer constructor with a boolean typed argument. No such constructor exists, so we’ll get a method lookup exception at runtime. After all, we don’t know in Haskell what methods really do exist, and what their signatures are. But if we call the java.lang.Integer constructor using Java syntax, we can hope to get the Java compiler to perform full scope checking and type checking, thus ruling out common errors such as calling non-existent methods are supplying arguments of the wrong type.

To achieve that, we use GHC’s quasiquotation extension. This extension allows us to embed syntax from arbitrary foreign languages in Haskell source files, in between special brackets. Better yet, we are free to extend the foreign syntax to express antiquotation variables, i.e. variables that refer to the enclosing context in Haskell. Take for example our earlier “hello world” code snippet, simplified:

do message <- reflect "Hello World!" [java| javax.swing.JOptionPane.showMessageDialog(null, $message) |]

Using reflect, also provided by inline-java, we create a J "java.lang.String" from a Haskell String. We can then refer to this Java object, bound to a Haskell variable, from inside the Java code snippet. The $ sigil is there to disambiguate between variables bound in the Haskell context (aka antiquotation) and in the Java context.

You might have noticed a difference with inline-c: in inline-java we don’t need to annotate quasiquotations with the return type nor each antiquote with their types, which can get quite verbose. Instead, we just about manage to infer which foreign types are intended based on the types of the Haskell variables. To pull this off required a journey in compiler hacking, ghc-heap-view and a novel use of static pointers. A journey best told next time…

The road ahead

There are plenty of solutions out there for lightweight interop across languages. You can start by swapping JSON messages between separate processes and take it from there. But for a truly universal solution fit for all situations, our experience is that keeping any overheads low or perhaps even nonexistent is the key enabler to seamlessly mixing multiple languages and blithely crossing language boundaries without guilt. In this post we presented a suite of packages for high-speed Java/Haskell interop, which together ensure:

  • box-free foreign calls: because we infer precise JVM types from Haskell types, arguments passed to JVM methods are boxed only if they need to be. Small values of primitive type can be passed to/from the JVM with no allocation at all on the heap.
  • marshalling-free argument passing: Java objects can be manipulated as easily from Haskell as from Java. This means that you can stick to representing all your data as Java objects if you find yourself calling into Java very frequently, hence avoiding any marshalling costs when transferring control to/from the JVM.
  • type safe Java calls: when calls are made in Java syntax, this syntax is supplied to an embedded instance of javac at compile-time for scope checking and type checking. That way we have a static guarantee that the types on the Haskell side match up with the types on the Java side, without having to resort to FFI stub generators and preprocessors.

We were fortunate enough to be able to stand on excellent libraries to get here. Take parsing of Java syntax: that came straight from Niklas Broberg and Vincent Hanquez’s venerable language-java library.

What we haven’t addressed yet with inline-java is the perennial issue when interoperating two garbage collected languages of automatic memory management. Since we have two heaps (the GHC heap and the JVM heap), with two garbage collectors, neither of which able to traverse objects in the other heap, we are forced to pin in memory objects shared across the language boundary. In the case of JVM objects, the JNI does this for us implicitly, provided object references are kept thread-local. It would be nice if we could make these object references safe across threads and get both garbage collectors to agree to dispose of them safely when dead. You can get a fair amount of mileage the way things are: we managed to run topic analysis on all of Wikipedia concurrently on 16 machines and hours of machine time without tinkering with object lifetimes and GC’s.

So plenty more to do still! Make sure to check out the project’s GitHub repository to follow progress and contribute.

Copyright 2015-2016 Tweag I/O.
Categories: Offsite Blogs

Dan Piponi (sigfpe): Expectation-Maximization with Less Arbitrariness

Planet Haskell - Sun, 10/16/2016 - 5:04pm
Introduction

There are many introductions to the Expectation-Maximisation algorithm. Unfortunately every one I could find uses arbitrary seeming tricks that seem to be plucked out of a hat by magic. They can all be justified in retrospect, but I find it more useful to learn from reusable techniques that you can apply to further problems. Examples of tricks I've seen used are:

  1. Using Jensen's inequality. It's easy to find inequalities that apply in any situation. But there are often many ways to apply them. Why apply it to this way of writing this expression and not that one which is equal?
  2. Substituting in the middle of an expression. Again, you can use just about anywhere. Why choose this at this time? Similarly I found derivations that insert a into an expression.
  3. Majorisation-Minimisation. This is a great technique, but involves choosing a function that majorises another. There are so many ways to do this, it's hard to imagine any general purpose method that tells you how to narrow down the choice.
My goal is to fill in the details of one key step in the derivation of the EM algorithm in a way that makes it inevitable rather than arbitrary. There's nothing original here, I'm merely expanding on a stackexchange answer.


Generalities about EM

The EM algorithm seeks to construct a maximum likelihood estimator (MLE) with a twist: there are some variables in the system that we can't observe.


First assume no hidden variables. We assume there is a vector of parameters that defines some model. We make some observations . We have a probability density that depends on . The likelihood of given the observations is . The maximum likelhood estimator for is the choice of that maximises for the we have observed.


Now suppose there are also some variables that we didn't get to observe. We assume a density . We now have

where we sum over all possible values of . The MLE approach says we now need to maximise One of the things that is a challenge here is that the components of might be mixed up among the terms in the sum. If, instead, each term only referred to its own unique block of , then the maximisation would be easier as we could maximise each term independently of the others. Here's how we might move in that direction. Consider instead the log-likelihood Now imagine that by magic we could commute the logarithm with the sum. We'd need to maximise One reason this would be to our advantage is that often takes the form where is a simple function to optimise. In addition, may break up as a sum of terms, each with its own block of 's. Moving the logarithm inside the sum would give us something we could easily maximise term by term. What's more, the for each is often a standard probability distribution whose likelihood we already know how to maximise. But, of course, we can't just move that logarithm in.


Maximisation by proxy

Sometimes a function is too hard to optimise directly. But if we have a guess for an optimum, we can replace our function with a proxy function that approximates it in the neighbourhood of our guess and optimise that instead. That will give us a new guess and we can continue from there. This is the basis of gradient descent. Suppose is a differentiable function in a neighbourhood of . Then around we have

We can try optimising with respect to within a neighbourhood of . If we pick a small circular neighbourhood then the optimal value will be in the direction of steepest descent. (Note that picking a circular neighbourhood is itself a somewhat arbitrary step, but that's another story.) For gradient descent we're choosing because it matches both the value and derivatives of at . We could go further and optimise a proxy that shares second derivatives too, and that leads to methods based on Newton-Raphson iteration.


We want our logarithm of a sum to be a sum of logarithms. But instead we'll settle for a proxy function that is a sum of logarithms. We'll make the derivatives of the proxy match those of the original function precisely so we're not making an arbitrary choice.


Write

The are constants we'll determine. We want to match the derivatives on either side of the at : On the other hand we have


To achieve equality we want to make these expressions match. We choose

Our desired proxy function is:


So the procedure is to take an estimated and obtain a new estimate by optimising this proxy function with respect to . This is the standard EM algorithm.


It turns out that this proxy has some other useful properties. For example, because of the concavity of the logarithm, the proxy is always smaller than the original likelihood. This means that when we optimise it we never optimise ``too far'' and that progress optimising the proxy is always progress optimising the original likelihood. But I don't need to say anything about this as it's all part of the standard literature.


Afterword

As a side effect we have a general purpose optimisation algorithm that has nothing to do with statistics. If your goal is to compute

you can iterate, at each step computing where is the previous iteration. If the take a convenient form then this may turn out to be much easier.


Note

This was originally written as a PDF using LaTeX. It'll be available here for a while. Some fidelity was lost when converting it to HTML.

Categories: Offsite Blogs

Michael Snoyman: New Conduit Tutorial

Planet Haskell - Wed, 10/12/2016 - 6:00pm

A few weeks back I proposed a reskin of conduit to make it easier to learn and use. The proposal overall got broad support, and therefore I went ahead with it. I then spent some time (quite a bit more than expected) updating the conduit tutorial to use this new reskin. If you're interested in conduit or streaming data in Haskell, please take a look at the new version.

Thanks to all who provided feedback. Also, if you want to provide some more feedback, there's one more Github issue open up: RFC: Stop using the type synonyms in library type signatures. Please feel free to share your opinions/add a reaction/start a flame war.

And yes, the flame war comment is a joke. Please don't take that one literally.

Categories: Offsite Blogs

Philip Wadler: Lambdaman (and Lambdawoman) supporting Bootstrap - Last Three Days!

Planet Haskell - Wed, 10/12/2016 - 3:42am
You have just three more days to order your own Lambdaman or Lambdawoman t-shirt, as featured in the video of Propositions as Types. Now available in unisex, children's, and women's shirts. Profits go to Bootstrap, an organisation run by Shriram Krishnamurthi, Matthias Felleisen, and the PLT group that teaches functional programming to middle and high school students. Order will be printed on October 15. 
Categories: Offsite Blogs

Edward Z. Yang: Try Backpack: ghc --backpack

Planet Haskell - Mon, 10/10/2016 - 2:39am

Backpack, a new system for mix-in packages in Haskell, has landed in GHC HEAD. This means that it has become a lot easier to try Backpack out: you just need a nightly build of GHC. Here is a step-by-step guide to get you started.

Download a GHC nightly

Get a nightly build of GHC. If you run Ubuntu, this step is very easy: add Herbert V. Riedel's PPA to your system and install ghc-head:

sudo add-apt-repository ppa:hvr/ghc sudo apt-get update sudo aptitude install ghc-head

This will place a Backpack-ready GHC in /opt/ghc/head/bin/ghc. My recommendation is you create a symlink named ghc-head to this binary from a directory that is in your PATH.

If you are not running Ubuntu, you'll have to download a nightly or build GHC yourself.

Hello World

GHC supports a new file format, bkp files, which let you easily define multiple modules and packages in a single source file, making it easy to experiment with Backpack. This format is not suitable for large scale programming, but we will use it for our tutorial. Here is a simple "Hello World" program:

unit main where module Main where main = putStrLn "Hello world!"

We define a unit (think package) with the special name main, and in it define a Main module (also specially named) which contains our main function. Place this in a file named hello.bkp, and then run ghc --backpack hello.bkp (using your GHC nightly). This will produce an executable at main/Main which you can run; you can also explicitly specify the desired output filename using -o filename. Note that by default, ghc --backpack creates a directory with the same name as every unit, so -o main won't work (it'll give you a linker error; use a different name!)

A Play on Regular Expressions

Let's write some nontrivial code that actually uses Backpack. For this tutorial, we will write a simple matcher for regular expressions as described in A Play on Regular Expressions (Sebastian Fischer, Frank Huch, Thomas Wilke). The matcher itself is inefficient (it checks for a match by testing all exponentially many decompositions of a string), but it will be sufficient to illustrate many key concepts of Backpack.

To start things off, let's go ahead and write a traditional implementation of the matcher by copy-pasting the code from this Functional Pearl into a Regex module in the Backpack file and writing a little test program to run it:

unit regex where module Regex where -- | A type of regular expressions. data Reg = Eps | Sym Char | Alt Reg Reg | Seq Reg Reg | Rep Reg -- | Check if a regular expression 'Reg' matches a 'String' accept :: Reg -> String -> Bool accept Eps u = null u accept (Sym c) u = u == [c] accept (Alt p q) u = accept p u || accept q u accept (Seq p q) u = or [accept p u1 && accept q u2 | (u1, u2) <- splits u] accept (Rep r) u = or [and [accept r ui | ui <- ps] | ps <- parts u] -- | Given a string, compute all splits of the string. -- E.g., splits "ab" == [("","ab"), ("a","b"), ("ab","")] splits :: String -> [(String, String)] splits [] = [([], [])] splits (c:cs) = ([], c:cs):[(c:s1,s2) | (s1,s2) <- splits cs] -- | Given a string, compute all possible partitions of -- the string (where all partitions are non-empty). -- E.g., partitions "ab" == [["ab"],["a","b"]] parts :: String -> [[String]] parts [] = [[]] parts [c] = [[[c]]] parts (c:cs) = concat [[(c:p):ps, [c]:p:ps] | p:ps <- parts cs] unit main where dependency regex module Main where import Regex nocs = Rep (Alt (Sym 'a') (Sym 'b')) onec = Seq nocs (Sym 'c') -- | The regular expression which tests for an even number of cs evencs = Seq (Rep (Seq onec onec)) nocs main = print (accept evencs "acc")

If you put this in regex.bkp, you can once again compile it using ghc --backpack regex.bkp and invoke the resulting executable at main/Main. It should print True.

Functorizing the matcher

The previously shown code isn't great because it hardcodes String as the type to do regular expression matching over. A reasonable generalization (which you can see in the original paper) is to match over arbitrary lists of symbols; however, we might also reasonably want to match over non-list types like ByteString. To support all of these cases, we will instead use Backpack to "functorize" (in ML parlance) our matcher.

We'll do this by creating a new unit, regex-indef, and writing a signature which provides a string type (we've decided to call it Str, to avoid confusion with String) and all of the operations which need to be supported on it. Here are the steps I took:

  1. First, I copy-pasted the old Regex implementation into the new unit. I replaced all occurrences of String with Str, and deleted splits and parts: we will require these to be implemented in our signature.

  2. Next, we create a new Str signature, which is imported by Regex, and defines our type and operations (splits and parts) which it needs to support:

    signature Str where data Str splits :: Str -> [(Str, Str)] parts :: Str -> [[Str]]
  3. At this point, I ran ghc --backpack to typecheck the new unit. But I got two errors!

    regex.bkp:90:35: error: • Couldn't match expected type ‘t0 a0’ with actual type ‘Str’ • In the first argument of ‘null’, namely ‘u’ In the expression: null u In an equation for ‘accept’: accept Eps u = null u regex.bkp:91:35: error: • Couldn't match expected type ‘Str’ with actual type ‘[Char]’ • In the second argument of ‘(==)’, namely ‘[c]’ In the expression: u == [c] In an equation for ‘accept’: accept (Sym c) u = u == [c]

    Traversable null nonsense aside, the errors are quite clear: Str is a completely abstract data type: we cannot assume that it is a list, nor do we know what instances it has. To solve these type errors, I introduced the combinators null and singleton, an instance Eq Str, and rewrote Regex to use these combinators (a very modest change.) (Notice we can't write instance Traversable Str; it's a kind mismatch.)

Here is our final indefinite version of the regex unit:

unit regex-indef where signature Str where data Str instance Eq Str null :: Str -> Bool singleton :: Char -> Str splits :: Str -> [(Str, Str)] parts :: Str -> [[Str]] module Regex where import Prelude hiding (null) import Str data Reg = Eps | Sym Char | Alt Reg Reg | Seq Reg Reg | Rep Reg accept :: Reg -> Str -> Bool accept Eps u = null u accept (Sym c) u = u == singleton c accept (Alt p q) u = accept p u || accept q u accept (Seq p q) u = or [accept p u1 && accept q u2 | (u1, u2) <- splits u] accept (Rep r) u = or [and [accept r ui | ui <- ps] | ps <- parts u]

(To keep things simple for now, I haven't parametrized Char.)

Instantiating the functor (String)

This is all very nice but we can't actually run this code, since there is no implementation of Str. Let's write a new unit which provides a module which implements all of these types and functions with String, copy pasting in the old implementations of splits and parts:

unit str-string where module Str where import Prelude hiding (null) import qualified Prelude as P type Str = String null :: Str -> Bool null = P.null singleton :: Char -> Str singleton c = [c] splits :: Str -> [(Str, Str)] splits [] = [([], [])] splits (c:cs) = ([], c:cs):[(c:s1,s2) | (s1,s2) <- splits cs] parts :: Str -> [[Str]] parts [] = [[]] parts [c] = [[[c]]] parts (c:cs) = concat [[(c:p):ps, [c]:p:ps] | p:ps <- parts cs]

One quirk when writing Backpack implementations for functions is that Backpack does no subtype matching on polymorphic functions, so you can't implement Str -> Bool with a polymorphic function Traversable t => t a -> Bool (adding this would be an interesting extension, and not altogether trivial). So we have to write a little impedance matching binding which monomorphizes null to the expected type.

To instantiate regex-indef with str-string:Str, we modify the dependency in main:

-- dependency regex -- old dependency regex-indef[Str=str-string:Str]

Backpack files require instantiations to be explicitly specified (this is as opposed to Cabal files, which do mix-in linking to determine instantiations). In this case, the instantiation specifies that regex-indef's signature named Str should be filled with the Str module from str-string.

After making these changes, give ghc --backpack a run; you should get out an identical looking result.

Instantiating the functor (ByteString)

The whole point of parametrizing regex was to enable us to have a second implementation of Str. So let's go ahead and write a bytestring implementation. After a little bit of work, you might end up with this:

unit str-bytestring where module Str(module Data.ByteString.Char8, module Str) where import Prelude hiding (length, null, splitAt) import Data.ByteString.Char8 import Data.ByteString type Str = ByteString splits :: Str -> [(Str, Str)] splits s = fmap (\n -> splitAt n s) [0..length s] parts :: Str -> [[Str]] parts s | null s = [[]] | otherwise = do n <- [1..length s] let (l, r) = splitAt n s fmap (l:) (parts r)

There are two things to note about this implementation:

  1. Unlike str-string, which explicitly defined every needed method in its module body, str-bytestring provides null and singleton simply by reexporting all of the entities from Data.ByteString.Char8 (which are appropriately monomorphic). We've cleverly picked our names to abide by the existing naming conventions of existing string packages!
  2. Our implementations of splits and parts are substantially more optimized than if we had done a straight up transcription of the consing and unconsing from the original String implementation. I often hear people say that String and ByteString have very different performance characteristics, and thus you shouldn't mix them up in the same implementation. I think this example shows that as long as you have sufficiently high-level operations on your strings, these performance changes smooth out in the end; and there is still a decent chunk of code that can be reused across implementations.

To instantiate regex-indef with bytestring-string:Str, we once again modify the dependency in main:

-- dependency regex -- oldest -- dependency regex-indef[Str=str-string:Str] -- old dependency regex-indef[Str=str-bytestring:Str]

We also need to stick an {-# LANGUAGE OverloadedStrings #-} pragma so that "acc" gets interpreted as a ByteString (unfortunately, the bkp file format only supports language pragmas that get applied to all modules defined; so put this pragma at the top of the file). But otherwise, everything works as it should!

Using both instantiations at once

There is nothing stopping us from using both instantiations of regex-indef at the same time, simply by uncommenting both dependency declarations, except that the module names provided by each dependency conflict with each other and are thus ambiguous. Backpack files thus provide a renaming syntax for modules which let you give each exported module a different name:

dependency regex-indef[Str=str-string:Str] (Regex as Regex.String) dependency regex-indef[Str=str-bytestring:Str] (Regex as Regex.ByteString)

How should we modify Main to run our regex on both a String and a ByteString? But is Regex.String.Reg the same as Regex.ByteString.Reg? A quick query to the compiler will reveal that they are not the same. The reason for this is Backpack's type identity rule: the identity of all types defined in a unit depends on how all signatures are instantiated, even if the type doesn't actually depend on any types from the signature. If we want there to be only one Reg type, we will have to extract it from reg-indef and give it its own unit, with no signatures.

After the refactoring, here is the full final program:

{-# LANGUAGE OverloadedStrings #-} unit str-bytestring where module Str(module Data.ByteString.Char8, module Str) where import Prelude hiding (length, null, splitAt) import Data.ByteString.Char8 import Data.ByteString type Str = ByteString splits :: Str -> [(Str, Str)] splits s = fmap (\n -> splitAt n s) [0..length s] parts :: Str -> [[Str]] parts s | null s = [[]] | otherwise = do n <- [1..length s] let (l, r) = splitAt n s fmap (l:) (parts r) unit str-string where module Str where import Prelude hiding (null) import qualified Prelude as P type Str = String null :: Str -> Bool null = P.null singleton :: Char -> Str singleton c = [c] splits :: Str -> [(Str, Str)] splits [] = [([], [])] splits (c:cs) = ([], c:cs):[(c:s1,s2) | (s1,s2) <- splits cs] parts :: Str -> [[Str]] parts [] = [[]] parts [c] = [[[c]]] parts (c:cs) = concat [[(c:p):ps, [c]:p:ps] | p:ps <- parts cs] unit regex-types where module Regex.Types where data Reg = Eps | Sym Char | Alt Reg Reg | Seq Reg Reg | Rep Reg unit regex-indef where dependency regex-types signature Str where data Str instance Eq Str null :: Str -> Bool singleton :: Char -> Str splits :: Str -> [(Str, Str)] parts :: Str -> [[Str]] module Regex where import Prelude hiding (null) import Str import Regex.Types accept :: Reg -> Str -> Bool accept Eps u = null u accept (Sym c) u = u == singleton c accept (Alt p q) u = accept p u || accept q u accept (Seq p q) u = or [accept p u1 && accept q u2 | (u1, u2) <- splits u] accept (Rep r) u = or [and [accept r ui | ui <- ps] | ps <- parts u] unit main where dependency regex-types dependency regex-indef[Str=str-string:Str] (Regex as Regex.String) dependency regex-indef[Str=str-bytestring:Str] (Regex as Regex.ByteString) module Main where import Regex.Types import qualified Regex.String import qualified Regex.ByteString nocs = Rep (Alt (Sym 'a') (Sym 'b')) onec = Seq nocs (Sym 'c') evencs = Seq (Rep (Seq onec onec)) nocs main = print (Regex.String.accept evencs "acc") >> print (Regex.ByteString.accept evencs "acc") And beyond!

Next time, I will tell you how to take this prototype in a bkp file, and scale it up into a set of Cabal packages. Stay tuned!

Postscript. If you are feeling adventurous, try further parametrizing regex-types so that it no longer hard-codes Char as the element type, but some arbitrary element type Elem. It may be useful to know that you can instantiate multiple signatures using the syntax dependency regex-indef[Str=str-string:Str,Elem=str-string:Elem] and that if you depend on a package with a signature, you must thread the signature through using the syntax dependency regex-types[Elem=<Elem>]. If this sounds user-unfriendly, it is! That is why in the Cabal package universe, instantiation is done implicitly, using mix-in linking.

Categories: Offsite Blogs

Joachim Breitner: T430s → T460s

Planet Haskell - Sat, 10/08/2016 - 3:22pm

Earlier this week, I finally got my new machine that came with my new position at the University of Pennsylvania: A shiny Thinkpad T460s that now replaces my T430s. (Yes, there is a pattern. It continues with T400 and T41p.) I decided to re-install my Debian system from scratch and copy over only the home directory – a bit of purification does not hurt. This blog post contains some random notes that might be useful to someone or alternative where I hope someone can tell me how to fix and improve things.

Installation

The installation (using debian-installer from a USB drive) went mostly smooth, including LVM on an encrypted partition. Unfortunately, it did not set up grub correctly for the UEFI system to boot, so I had to jump through some hoops (using the grub on the USB drive to manually boot into the installed system, and installing grub-efi from there) until the system actually came up.

High-resolution display

This laptop has a 2560×1440 high resolution display. Modern desktop environments like GNOME supposedly handle that quite nicely, but for reasons explained in an earlier post, I do not use a desktop envrionment but have a minimalistic setup based on Xmonad. I managed to get a decent setup now, by turning lots of manual knobs:

  • For the linux console, setting

    FONTFACE="Terminus" FONTSIZE="12x24"

    in /etc/default/console-setup yielded good results.

  • For the few GTK-2 applications that I am still running, I set

    gtk-font-name="Sans 16"

    in ~/.gtkrc-2.0. Similarly, for GTK-3 I have

    [Settings] gtk-font-name = Sans 16

    in ~/.config/gtk-3.0/settings.ini.

  • Programs like gnome-terminal, Evolution and hexchat refer to the “System default document font” and “System default monospace font”. I remember that it was possible to configure these in the GNOME control center, but I could not find any way of configuring these using command line tools, so I resorted to manually setting the font for these. With the help from Alexandre Franke I figured out that the magic incarnation here is:

    gsettings set org.gnome.desktop.interface monospace-font-name 'Monospace 16' gsettings set org.gnome.desktop.interface document-font-name 'Serif 16' gsettings set org.gnome.desktop.interface font-name 'Sans 16'
  • Firefox seemed to have picked up these settings for the UI, so that was good. To make web pages readable, I set layout.css.devPixelsPerPx to 1.5 in about:config.

  • GVim has set guifont=Monospace\ 16 in ~/.vimrc. The toolbar is tiny, but I hardly use it anyways.

  • Setting the font of Xmonad prompts requires the sytax

    , font = "xft:Sans:size=16"

    Speaking about Xmonad prompts: Check out the XMonad.Prompt.Unicode module that I have been using for years and recently submitted upstream.

  • I launch Chromium (or rather the desktop applications that I use that happen to be Chrome apps) with the parameter --force-device-scale-factor=1.5.

  • Libreoffice seems to be best configured by running xrandr --dpi 194 before hand. This seems also to be read by Firefox, doubling the effect of the font size in the gtk settings, which is annoying. Luckily I do not work with Libreoffice often, so for now I’ll just set that manually when needed.

I am not quite satisfied. I have the impression that the 16 point size font, e.g. in Evolution, is not really pretty, so I am happy to take suggestions here.

I found the ArchWiki page on HiDPI very useful here.

Trackpoint and Touchpad

One reason for me to sticking with Thinkpads is their trackpoint, which I use exclusively. In previous models, I disabled the touchpad in the BIOS, but this did not seem to have an effect here, so I added the following section to /etc/X11/xorg.conf.d/30-touchpad.conf

Section "InputClass" Identifier "SynPS/2 Synaptics TouchPad" MatchProduct "SynPS/2 Synaptics TouchPad" Option "ignore" "on" EndSection

At one point I left out the MatchProduct line, disabling all input in the X server. Had to boot into recovery mode to fix that.

Unfortunately, there is something wrong with the trackpoint and the buttons: When I am moving the trackpoint (and maybe if there is actual load on the machine), mouse button press and release events sometimes get lost. This is quite annoying – I try to open a folder in Evolution and accidentially move it.

I installed the latest Kernel from Debian experimental (4.8.0-rc8), but it did not help.

I filed a bug report against libinput although I am not fully sure that that’s the culprit.

Update: According to Benjamin Tissoires it is a known firmware bug and the appropriate people are working on a work-around. Until then I am advised to keep my palm of the touchpad.

Also, I found the trackpoint too slow. I am not sure if it is simply because of the large resolution of the screen, or because some movement events are also swallowed. For now, I simply changed the speed by writing

SUBSYSTEM=="serio", DRIVERS=="psmouse", ATTRS{speed}="120"

to /etc/udev/rules.d/10-trackpoint.rules.

Brightness control

The system would not automatically react to pressing Fn-F5 and Fn-F6, which are the keys to adjust the brightness. I am unsure about how and by what software component it “should” be handled, but the solution that I found was to set

Section "Device" Identifier "card0" Driver "intel" Option "Backlight" "intel_backlight" BusID "PCI:0:2:0" EndSection

so that the command line tool xbacklight would work, and then use Xmonad keybinds to perform the action, just as I already do for sound control:

, ((0, xF86XK_Sleep), spawn "dbus-send --system --print-reply --dest=org.freedesktop.UPower /org/freedesktop/UPower org.freedesktop.UPower.Suspend") , ((0, xF86XK_AudioMute), spawn "ponymix toggle") , ((0, 0x1008ffb2 {- xF86XK_AudioMicMute -}), spawn "ponymix --source toggle") , ((0, xF86XK_AudioRaiseVolume), spawn "ponymix increase 5") , ((0, xF86XK_AudioLowerVolume), spawn "ponymix decrease 5") , ((shiftMask, xF86XK_AudioRaiseVolume), spawn "ponymix increase 5 --max-volume 200") , ((shiftMask, xF86XK_AudioLowerVolume), spawn "ponymix decrease 5") , ((0, xF86XK_MonBrightnessUp), spawn "xbacklight +10") , ((0, xF86XK_MonBrightnessDown), spawn "xbacklight -10")

The T460s does not actually have a sleep button, that line is a reminiscence from my T430s. I suspend the machine by pressing the power button now, thanks to HandlePowerKey=suspend in /etc/systemd/logind.conf.

Profile Weirdness

Something strange happend to my environment variables after the move. It is clearly not hardware related, but I simply cannot explain what has changed: All relevant files in /etc look similar enough.

I use ~/.profile to extend the PATH and set some other variables. Previously, these settings were in effect in my whole X session, which is started by lightdm with auto-login, followed by xmonad-session. I could find no better way to fix that than stating . ~/.profile early in my ~/.xmonad/xmonad-session-rc. Very strange.

Categories: Offsite Blogs

FP Complete: Static compilation with Stack

Planet Haskell - Thu, 10/06/2016 - 8:00pm

In our last blog post we showed you the new docker init executable pid1. What if we wanted to use our shiny new pid1 binary on a CentOS Docker image but we compiled it on Ubuntu? The answer is that it wouldn't likely work. All Linux flavors package things up a little differently and with different versions and flags.

If we were to compile pid1 completely static it could be portable (within a given range of Linux kernel versions). Let's explore different ways to compile a GHC executable with Stack. Maybe we can come up with a way to create portable binaries.

Base Image for Experiments

First let's create a base image since we are going to be trying many different compilation scenarios.

Here's a Dockerfile for Alpine Linux & GHC 8.0 with Stack.

# USE ALPINE LINUX FROM alpine RUN apk update # INSTALL BASIC DEV TOOLS, GHC, GMP & ZLIB RUN echo "https://s3-us-west-2.amazonaws.com/alpine-ghc/8.0" >> /etc/apk/repositories ADD https://raw.githubusercontent.com/mitchty/alpine-ghc/master/mitch.tishmack%40gmail.com-55881c97.rsa.pub \ /etc/apk/keys/mitch.tishmack@gmail.com-55881c97.rsa.pub RUN apk update RUN apk add alpine-sdk git ca-certificates ghc gmp-dev zlib-dev # GRAB A RECENT BINARY OF STACK ADD https://s3.amazonaws.com/static-stack/stack-1.1.2-x86_64 /usr/local/bin/stack RUN chmod 755 /usr/local/bin/stack

Let's build it and give it a tag.

docker build --no-cache=true --tag fpco/pid1:0.1.0-base .Default GHC Compilation

Next let's compile pid1 with default Stack & GHC settings.

Here's our minimalist stack.yaml file.

resolver: lts-7.1

Here's our project Dockerfile that extends our test base image above.

FROM fpco/pid1:0.1.0-base # COMPILE PID1 ADD ./ /usr/src/pid1 WORKDIR /usr/src/pid1 RUN stack --local-bin-path /sbin install --test # SHOW INFORMATION ABOUT PID1 RUN ldd /sbin/pid1 || true RUN du -hs /sbin/pid1

Let's compile this default configuration using Docker and give it a label.

docker build --no-cache=true --tag fpco/pid1:0.1.0-default .

A snippet from the Docker build showing the results.

Step 6 : RUN ldd /sbin/pid1 || true ---> Running in fcc138c199d0 /lib/ld-musl-x86_64.so.1 (0x559fe5aaf000) libgmp.so.10 => /usr/lib/libgmp.so.10 (0x7faff710b000) libc.musl-x86_64.so.1 => /lib/ld-musl-x86_64.so.1 (0x559fe5aaf000) ---> 70836a2538e2 Removing intermediate container fcc138c199d0 Step 7 : RUN du -hs /sbin/pid1 ---> Running in 699876efeb1b 956.0K /sbin/pid1

You can see that this build results in a semi-static binary with a link to MUSL (libc) and GMP. This is not extremely portable. We will always have to be concerned about the dynamic linkage happening at run-time. This binary would probably not run on Ubuntu as is.

100% Static

Let's try compiling our binary as a 100% static Linux ELF binary without any link to another dynamic library. Note that our open source license must be compatible with MUSL and GMP in order to do this.

Let's try a first run with static linkage. Here's another Dockerfile that shows a new ghc-option to link statically.

FROM fpco/pid1:0.1.0-base # TRY TO COMPILE ADD ./ /usr/src/pid1 WORKDIR /usr/src/pid1 RUN stack --local-bin-path /sbin install --test --ghc-options '-optl-static'

Let's give it a go.

docker build --no-cache=true --tag fpco/pid1:0.1.0-static .

Oh no. It didn't work. Looks like there's some problem with linking. :|

[1 of 1] Compiling System.Process.PID1 ( src/System/Process/PID1.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/System/Process/PID1.o ) /usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: /usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/crtbeginT.o: relocation R_X86_64_32 against `__TMC_END__' can not be used when making a shared object; recompile with -fPIC /usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/crtbeginT.o: error adding symbols: Bad value collect2: error: ld returned 1 exit status `gcc' failed in phase `Linker'. (Exit code: 1) -- While building package pid1-0.1.0 using: /root/.stack/setup-exe-cache/x86_64-linux/setup-Simple-Cabal-1.24.0.0-ghc-8.0.1 --builddir=.stack-work/dist/x86_64-linux/Cabal-1.24.0.0 build lib:pid1 exe:pid1 --ghc-options " -ddump-hi -ddump-to-file" Process exited with code: ExitFailure 1PIC flag

OK that last error said we should recompile with -fPIC. Let's try that. Once again, here's a Dockerfile with the static linkage flag & the new -fPIC flag.

FROM fpco/pid1:0.1.0-base # TRY TO COMPILE ADD ./ /usr/src/pid1 WORKDIR /usr/src/pid1 RUN stack --local-bin-path /sbin install --test --ghc-options '-optl-static -fPIC'

Let's give it a try.

docker build --no-cache=true --tag fpco/pid1:0.1.0-static-fpic .

But we still get the error again.

[1 of 1] Compiling System.Process.PID1 ( src/System/Process/PID1.hs, .stack-work/dist/x86_64-linux/Cabal-1.24.0.0/build/System/Process/PID1.o ) /usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: /usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/crtbeginT.o: relocation R_X86_64_32 against `__TMC_END__' can not be used when making a shared object; recompile with -fPIC /usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/crtbeginT.o: error adding symbols: Bad value collect2: error: ld returned 1 exit status `gcc' failed in phase `Linker'. (Exit code: 1) -- While building package pid1-0.1.0 using: /root/.stack/setup-exe-cache/x86_64-linux/setup-Simple-Cabal-1.24.0.0-ghc-8.0.1 --builddir=.stack-work/dist/x86_64-linux/Cabal-1.24.0.0 build lib:pid1 exe:pid1 --ghc-options " -ddump-hi -ddump-to-file" Process exited with code: ExitFailure 1crtbeginT swap

Searching around for this crtbegint linkage problem we find that if we provide a hack that it'll work correctly. Here's the Dockerfile with the hack.

FROM fpco/pid1:0.1.0-base # FIX https://bugs.launchpad.net/ubuntu/+source/gcc-4.4/+bug/640734 WORKDIR /usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/ RUN cp crtbeginT.o crtbeginT.o.orig RUN cp crtbeginS.o crtbeginT.o # COMPILE PID1 ADD ./ /usr/src/pid1 WORKDIR /usr/src/pid1 RUN stack --local-bin-path /sbin install --test --ghc-options '-optl-static -fPIC' # SHOW INFORMATION ABOUT PID1 RUN ldd /sbin/pid1 || true RUN du -hs /sbin/pid1

When we try it again

docker build --no-cache=true --tag fpco/pid1:0.1.0-static-fpic-crtbegint .

It works this time!

Step 8 : RUN ldd /sbin/pid1 || true ---> Running in 8b3c737c2a8d ldd: /sbin/pid1: Not a valid dynamic program ---> 899f06885c71 Removing intermediate container 8b3c737c2a8d Step 9 : RUN du -hs /sbin/pid1 ---> Running in d641697cb2a8 1.1M /sbin/pid1 ---> aa17945f5bc4

Nice. 1.1M isn't too bad for a binary that's portable. Let's see if we can make it smaller though. On larger executables, especially with other linked external libraries, this static output can be 50MB(!)

Optimal SizeGCC Optimization

It says on the GCC manpage if we use -Os that this will optimize for size. Let's try it.

Specify -optc-Os to optimize for size.

FROM fpco/pid1:0.1.0-base # FIX https://bugs.launchpad.net/ubuntu/+source/gcc-4.4/+bug/640734 WORKDIR /usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/ RUN cp crtbeginT.o crtbeginT.o.orig RUN cp crtbeginS.o crtbeginT.o # COMPILE PID1 ADD ./ /usr/src/pid1 WORKDIR /usr/src/pid1 RUN stack --local-bin-path /sbin install --test --ghc-options '-optl-static -fPIC -Os' # SHOW INFORMATION ABOUT PID1 RUN ldd /sbin/pid1 || true RUN du -hs /sbin/pid1 docker build --no-cache=true --tag fpco/pid1:0.1.0-static-fpic-crtbegint-optcos . Step 9 : RUN ldd /sbin/pid1 || true ---> Running in 8e28314924d0 ldd: /sbin/pid1: Not a valid dynamic program ---> c977f078eb24 Removing intermediate container 8e28314924d0 Step 10 : RUN du -hs /sbin/pid1 ---> Running in 4e6b5c4d87aa 1.1M /sbin/pid1 ---> 66d459e3fcc1

There isn't any difference in output size with this flag. You may want to try it on a little larger or more complex executable to see if it makes a difference for you.

Split Objects

GHC allows us to "split objects" when we compile Haskell code. That means each Haskell module is broken up into it's own native library. In this scenario, when we import a module, our final executable is linked against smaller split modules instead of to the entire package. This helps reduce the size of the executable. The trade-off is that it takes more time for GHC to compile.

resolver: lts-7.1 build: { split-objs: true } docker build --no-cache=true --tag fpco/pid1:0.1.0-static-fpic-crtbegint-optcos-split . Step 9 : RUN ldd /sbin/pid1 || true ---> Running in 8e28314924d0 ldd: /sbin/pid1: Not a valid dynamic program ---> c977f078eb24 Removing intermediate container 8e28314924d0 Step 10 : RUN du -hs /sbin/pid1 ---> Running in 4e6b5c4d87aa 1.1M /sbin/pid1 ---> 66d459e3fcc1

There isn't any difference in output size with this flag in this case. On some executables this really makes a big difference. Try it yourself.

UPX Compression

Let's try compressing our static executable with UPX. Here's a Dockerfile.

FROM fpco/pid1:0.1.0-base # FIX https://bugs.launchpad.net/ubuntu/+source/gcc-4.4/+bug/640734 WORKDIR /usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/ RUN cp crtbeginT.o crtbeginT.o.orig RUN cp crtbeginS.o crtbeginT.o # COMPILE PID1 ADD ./ /usr/src/pid1 WORKDIR /usr/src/pid1 RUN stack --local-bin-path /sbin install --test --ghc-options '-optl-static -fPIC -optc-Os' # COMPRESS WITH UPX ADD https://github.com/lalyos/docker-upx/releases/download/v3.91/upx /usr/local/bin/upx RUN chmod 755 /usr/local/bin/upx RUN upx --best --ultra-brute /sbin/pid1 # SHOW INFORMATION ABOUT PID1 RUN ldd /sbin/pid1 || true RUN du -hs /sbin/pid1

Build an image that includes UPX compression.

docker build --no-cache=true --tag fpco/pid1:0.1.0-static-fpic-crtbegint-optcos-split-upx .

And, wow, that's some magic.

Step 11 : RUN ldd /sbin/pid1 || true ---> Running in 69f86bd03d01 ldd: /sbin/pid1: Not a valid dynamic program ---> c01d54dca5ac Removing intermediate container 69f86bd03d01 Step 12 : RUN du -hs /sbin/pid1 ---> Running in 01bbed565de0 364.0K /sbin/pid1 ---> b94c11bafd95

This makes a huge difference with the resulting executable 1/3 the original size. There is a small price to pay in extracting the executable on execution but for a pid1 that just runs for the lifetime of the container, this is not noticeable.

Slackware Support

Here's a Slackware example running pid1 that was compiled on Alpine Linux

FROM vbatts/slackware ADD https://s3.amazonaws.com/download.fpcomplete.com/pid1/pid1-0.1.0-amd64 /sbin/pid1 RUN chmod 755 /sbin/pid1 ENTRYPOINT [ "/sbin/pid1" ] CMD bash -c 'while(true); do sleep 1; echo alive; done'

Build an image that includes UPX compression.

docker build -t fpco/pid1:0.1.0-example-slackware . docker run --rm -i -t fpco/pid1:0.1.0-example-slackware

It works!

alive alive alive ^C
Categories: Offsite Blogs

Brent Yorgey: ICFP roundup

Planet Haskell - Tue, 10/04/2016 - 9:36pm

ICFP 2016 in Nara, Japan was a blast. Here are a few of my recollections.

The Place

Although I was a coathor on an ICFP paper in 2011, when it was in Tokyo, I did not go since my son was born the same week. So this was my first time in Japan, or anywhere in Asia, for that matter. (Of course, this time I missed my son’s fifth birthday…)

I’ve been to Europe multiple times, and although it is definitely foreign, the culture is similar enough that I feel like I basically know how to behave. I did not feel that way in Japan. I’m pretty sure I was constantly being offensive without realizing it, but most of the time people were polite and accommodating.

…EXCEPT for that one time I was sitting in a chair chatting with folks during a break between sessions, with my feet up on a (low, plain) table, and an old Japanese guy WHACKED his walking stick on the table and shouted angrily at me in Japanese. That sure got my adrenaline going. Apparently putting your feet on the table is a big no-no, lesson learned.

The food was amazing even though I didn’t know what half of it was. I was grateful that I (a) am not vegetarian, (b) know how to use chopsticks decently well, and (c) am an adventurous eater. If any one of those were otherwise, things might have been more difficult!

On my last day in Japan I had the whole morning before I needed to head to the airport, so Ryan Yates and I wandered around Nara and saw a bunch of temples, climbed the hill, and such. It’s a stunningly beautiful place with a rich history.

The People

As usual, it’s all about the people. I enjoyed meeting some new people, including (but not limited to):

  • Pablo Buiras and Marco Vassena were my hotel breakfast buddies, it was fun getting to know them a bit.
  • I finally met Dominic Orchard, though I feel like I’ve known his name and known about some of his work for a while.
  • I don’t think I had met Max New before but we had a nice chat about the Scheme enumerations library he helped develop and combinatorial species. I hope to be able to follow up that line of inquiry.
  • As promised, I met everyone who commented on my blog post, including Jürgen Peters (unfortunately we did not get a chance to play go), Andrey Mokhov (who nerd-sniped me with a cool semiring-ish thing with some extra structure — perhaps that will be another blog post), and Jay McCarthy (whom I had actually met before, but we had some nice chats, including one in the airport while waiting for our flight to LAX).
  • I don’t think I had met José Manuel Calderón Trilla before; we had a great conversation over a meal together (along with Ryan Yates) in the Osaka airport while waiting for our flights.
  • I met Diogenes Nunez, who went to my alma mater Williams College. When I taught at Williams a couple years ago I’m pretty sure I heard Diogenes mentioned by the other faculty, so it was fun to get to meet him.
  • Last but certainly not least, I met my coauthor, Piyush Kurur. We collaborated on a paper through the magic of the Internet (Github in particular), and I actually met him in person for the first time just hours before he presented our paper!

My student Ollie Kwizera came for PLMW—it was fun having him there. I only crossed paths with him three or four times, but I think that was all for the best, since he made his own friends and had his own experiences.

Other people who I enjoyed seeing and remember having interesting conversations with include (but I am probably forgetting someone!) Michael Adams, Daniel Bergey, Jan Bracker, Joachim Breitner, David Christiansen, David Darais, Stephen Dolan, Richard Eisenberg, Kenny Foner, Marco Gaboardi, Jeremy Gibbons, John Hughes, David Janin, Neel Krishnaswami, Dan Licata, Andres Löh, Simon Marlow, Tom Murphy, Peter-Michael Osera, Jennifer Paykin, Simon Peyton Jones, Ryan Scott, Mary Sheeran, Mike Sperber, Luite Stegeman, Wouter Swierstra, David Terei, Ryan Trinkle, Tarmo Uustalu, Stephanie Weirich, Nick Wu, Edward Yang, and Ryan Yates. My apologies if I forgot you, just remind me and I’ll add you to the list! I’m amazed and grateful I get to know all these cool people.

The Content

Here are just a few of my favorite talks:

  • I’m a sucker for anything involving geometry and/or random testing and/or pretty pictures, and Ilya Sergey’s talk Growing and Shrinking Polygons for Random testing of Computational Geometry had them all. In my experience, doing effective random testing in any domain beyond basic functions usually requires some interesting domain-specific insights, and Ilya had some cool insights about ways to generate and shrink polygons in ways that were much more likely to generate small counterexamples for computational geometry algorithms.

  • Idris gets more impressive by the day, and I always enjoy David Christiansen’s talks.

  • Sandra Dylus gave a fun talk, All Sorts of Permutations, with the cute observation that a sorting algorithm equipped with a nondeterministic comparison operator generates permutations (though it goes deeper than that). During the question period someone asked whether there is a way to generate all partitions, and someone sitting next to me suggested using the group function—and indeed, I think this works. I wonder what other sorts of combinatorial objects can be enumerated by this method. In particular I wonder if quicksort with nondeterministic comparisons can be adapted to generate not just all permutations, but all binary trees.

  • I greatly enjoyed TyDe, especially Jeremy Gibbons’ talk on APLicative Programming with Naperian Functors (I don’t think the video is online yet, if there is one). I’ll be serving as co-chair of the TyDe program committee next year, so start thinking about what you would like to submit!

  • There were also some fun talks at FARM, for example, Jay McCarthy’s talk on Bithoven. But I don’t think the FARM videos are uploaded yet. Speaking of FARM, the performance evening was incredible. It will be hard to live up to next year.


Categories: Offsite Blogs

FP Complete: Docker demons: PID-1, orphans, zombies, and signals

Planet Haskell - Tue, 10/04/2016 - 8:00pm

There are a number of corner cases to consider when dealing with Docker, multiple processes, and signals. Probably the most famous post on this matter is from the Phusion blog. Here, we'll see some examples of how to see these problems first hand, and one way to work around it: fpco/pid1.

The Phusion blog post recommends using their baseimage-docker. This image provides a my_init entrypoint which handles the problems described here, as well as introducing some extra OS features, such as syslog handling. Unfortunately, we ran into problems with Phusion's usage of syslog-ng, in particular with it creating unkillable processes pegged at 100% CPU usage. We're still investigating the root cause, but in practice we have found that the syslog usage is a far less motivating case than simply a good init process, which is why we've created the pid1 Haskell package together with a simple fpco/pid1 Docker image.

This blog post is intended to be interactive: you'll get the most bang for your buck by opening up your terminal and running commands along with reading the text. It will be far more motivating to see your Ctrl-C completely fail to kill a process.

NOTE The primary reason we wrote our own implementation in Haskell was to be able to embed it within the Stack build tool. There are other lightweight init processes already available, such as dumb-init. I've also blogged about using dumb-init. While this post uses pid1, there's nothing specific to it versus other init processes.

Playing with entrypoints

Docker has a concept of entrypoints, which provides a default wrapping command for commands you provides to docker run. For example, consider this interaction with Docker:

$ docker run --entrypoint /usr/bin/env ubuntu:16.04 FOO=BAR bash c 'echo $FOO' BAR

This works because the above is equivalent to:

$ docker run ubuntu:16.04 /usr/bin/env FOO=BAR bash -c 'echo $FOO'

Entrypoints can be overridden on the command line (as we just did), but can also be specified in the Dockerfile (which we'll do later). The default entrypoint for the ubuntu Docker image is a null entrypoint, meaning that the provided command will be run directly without any wrapping. We're going to simulate that experience by using /usr/bin/env as an entrypoint, since switching entrypoint back to null isn't yet supported in released Docker. When you run /usr/bin/env foo bar baz, the env process will exec the foo command, making foo the new PID 1, which for our purposes gives it the same behavior as a null entrypoint.

Both the fpco/pid1 and snoyberg/docker-testing images we'll use below set /sbin/pid1 as the default entrypoint. In the example commands, we're explicitly including --entrypoint /sbin/pid1. This is just to be clear on which entrypoint is being used; if you exclude that option, the same behavior will persist.

Sending TERM signal to process

We'll start with our sigterm.hs program, which runs ps (we'll see why soon), then sends itself a SIGTERM and then loops forever. On a Unix system, the default process behavior when receiving a SIGTERM is to exit. Therefore, we'd expect that our process will just exit when run. Let's see:

$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing sigterm PID TTY TIME CMD 1 ? 00:00:00 sigterm 9 ? 00:00:00 ps Still alive! Still alive! Still alive! ^C $

The process ignored the SIGTERM and kept running, until I hit Ctrl-C (we'll see what that does later). Another feature in the sigterm code base, though, is that if you give it the command line argument install-handler, it will explicitly install a SIGTERM handler which will kill the process. Perhaps surprisingly, this has a significant impact on our application:

$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing sigterm install-handler PID TTY TIME CMD 1 ? 00:00:00 sigterm 8 ? 00:00:00 ps Still alive! $

The reason for this is some Linux kernel magic: the kernel treats a process with PID 1 specially, and does not, by default, kill the process when receiving the SIGTERM or SIGINT signals. This can be very surprising behavior. For a simpler example, try running the following commands in two different terminals:

$ docker run --rm --name sleeper ubuntu:16.04 sleep 100 $ docker kill -s TERM sleeper

Notice how the docker run command does not exit, and if you check your ps aux output, you'll see that the process is still running. That's because the sleep process was not designed to be PID 1, and does not install a special signal handler. To work around this problem, you've got two choices:

  1. Ensure every command you run from docker run has explicit handling of SIGTERM.
  2. Make sure the command you run isn't PID 1, but instead use a process that is designed to handle SIGTERM correctly.

Let's see how the sigterm program works with our /sbin/pid1 entrypoint:

$ docker run --rm --entrypoint /sbin/pid1 snoyberg/docker-testing sigterm PID TTY TIME CMD 1 ? 00:00:00 pid1 8 ? 00:00:00 sigterm 12 ? 00:00:00 ps

The program exits immediately, as we'd like. But look at the ps output: our first process is now pid1 instead of sigterm. Since sigerm is being launched as a different PID (8 in this case), the special casing from the Linux kernel does not come into play, and default SIGTERM handling is active. To step through exactly what happens in our case:

  1. Our container is created, and the command /usr/sbin/pid1 sigterm is run inside of it.
  2. pid1 starts as PID-1, does its business, and then fork/execs the sigterm executable.
  3. sigterm raises the SIGTERM signal to itself, causing it to die.
  4. pid1 sees that its child died from SIGTERM (== signal 15) and exits with exit code 143 (== 128 + 15).
  5. Since our PID1 is dead, our container dies too.

This isn't just some magic with sigterm, you can do the same thing with sleep:

$ docker run --rm --name sleeper fpco/pid1 sleep 100 $ docker kill -s TERM sleeper

Unlike with the ubuntu image, this will kill the container immediately, due to the /sbin/pid1 entrypoint used by fpco/pid1.

NOTE In the case of sigterm, which sends the TERM signal to itself, it turns out you don't need a special PID1 process with signal handling, anything will do. For example, try docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing /bin/bash -c "sigterm;echo bye". But playing with sleep will demonstrate the need for a real signal-aware PID1 process.

Ctrl-C: sigterm vs sleep

There's a slight difference between sigterm and sleep when it comes to the behavior of sending hitting Ctrl-C. When you use Ctrl-C, it sends a SIGINT to the docker run process, which proxies that signal to the process inside the container. sleep will ignore it, just as it ignores SIGTERM, due to the default signal handlers for PID1 in the Linux kernel. However, the sigterm executable is written in Haskell, and the Haskell runtime itself installs a signal handler that converts SIGINT into a user interrupt exception, overriding the PID1 default behavior. For more on signal proxying, see the docker attach documentation.

Reaping orphans

Suppose you have process A, which fork/execs process B. When process B dies, process A must call waitpid to get its exit status from the kernel, and until it does so, process B will be dead but with an entry in the system process table. This is known as being a zombie.

But what happens if process B outlives process A? In this case, process B is known as an orphan, and needs to be adopted by the init process, aka PID1. It is the init process's job to reap orphans so they do not remain as zombies.

The orphans.hs program will:

  • Spawn a child process, and then loop forever calling ps
  • In the child process: run the echo command a few times, without calling waitpid, and then exit

As you can see, none of the processes involved will reap the zombie echo processes. The output from the process confirms that we have, in fact, created zombies:

$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing orphans 1 2 3 4 Still alive! PID TTY TIME CMD 1 ? 00:00:00 orphans 8 ? 00:00:00 orphans 13 ? 00:00:00 echo <defunct> 14 ? 00:00:00 echo <defunct> 15 ? 00:00:00 echo <defunct> 16 ? 00:00:00 echo <defunct> 17 ? 00:00:00 ps Still alive! PID TTY TIME CMD 1 ? 00:00:00 orphans 13 ? 00:00:00 echo <defunct> 14 ? 00:00:00 echo <defunct> 15 ? 00:00:00 echo <defunct> 16 ? 00:00:00 echo <defunct> 18 ? 00:00:00 ps Still alive!

And so on until we kill the container. That <defunct> indicates a zombie process. The issue is that our PID 1, orphans, doesn't do reaping. As you probably guessed, we can solve this by just using the /sbin/pid1 entrypoint:

$ docker run --rm --entrypoint /sbin/pid1 snoyberg/docker-testing orphans 1 2 3 4 Still alive! PID TTY TIME CMD 1 ? 00:00:00 pid1 10 ? 00:00:00 orphans 14 ? 00:00:00 orphans 19 ? 00:00:00 echo <defunct> 20 ? 00:00:00 echo <defunct> 21 ? 00:00:00 echo <defunct> 22 ? 00:00:00 echo <defunct> 23 ? 00:00:00 ps Still alive! PID TTY TIME CMD 1 ? 00:00:00 pid1 10 ? 00:00:00 orphans 24 ? 00:00:00 ps Still alive!

pid1 now adopts the echo processes when the child orphans process dies, and reaps accordingly.

Surviving children

Let's try out something else: process A is the primary command for the Docker container, and it spawns process B. Before process B exits, process A exits, causing the Docker container to exit. In this case, the running process B will be forcibly closed by the kernel (see this Stack Overflow question for details). We can see this with our surviving.hs program

$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing surviving Parent sleeping Child: 1 Child: 2 Child: 4 Child: 3 Child: 1 Child: 2 Child: 3 Child: 4 Parent exiting

Unfortunately this doesn't give our child processes a chance to do any cleanup. Instead, we would rather send them a SIGTERM, and after a grace period send them a SIGKILL. This is exactly what pid1 does:

$ docker run --rm --entrypoint /sbin/pid1 snoyberg/docker-testing surviving Parent sleeping Child: 2 Child: 3 Child: 1 Child: 4 Child: 2 Child: 1 Child: 4 Child: 3 Parent exiting Got a TERM Got a TERM Got a TERM Got a TERMSignaling docker run vs PID1

When you run sleep 60 and then hit Ctrl-C, the sleep process itself receives a SIGINT. When you instead run docker run --rm fpco/pid1 sleep 60 and hit Ctrl-C, you may think that the same thing is happening. However, in reality, it's not at all the same. Your docker run call creates a docker run process, which sends a command to the Docker daemon on your machine, and that daemon creates the actual sleep process (inside a container). When you hit Ctrl-C on your terminal, you're sending SIGINT to docker run, which is in fact sending a command to the Docker daemon, which in turn sends a SIGINT to your sleep process.

Want proof? Try out the following:

$ docker run --rm fpco/pid1 sleep 60& [1] 417 $ kill -KILL $! $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 69fbc70e95e2 fpco/pid1 "/sbin/pid1 sleep 60" 11 seconds ago Up 11 seconds hopeful_mayer [1]+ Killed docker run --rm fpco/pid1 sleep 60

In this case, we sent a SIGKILL to the docker run command. Unlike SIGINT or SIGTERM, and SIGKILL cannot be handled, and therefore docker run is unable to delegate signal handling to a different process. As a result, the docker run command itself dies, but the sleep process (and its container) continue running.

Some takeaways from this:

  • Make sure you use something like pid1 so that your SIGINT or SIGTERM to the docker run process actually get your container to reliably shut down
  • If you must send a SIGKILL to your process, use the docker kill command instead
Alternative to entrypoint

We've used --entrypoint /sbin/pid1 a lot here. In fact, each usage of that has been superfluous, since the fpco/pid1 and snoyberg/docker-testing images both use /sbin/pid1 as their default entrypoint anyway. I included it for explicitness. To prove it to you:

$ docker run --rm fpco/pid1 sleep 60 ^C$

But if you don't want to muck with entrypoints, you can always just include /sbin/pid1 at the beginning of your command, e.g.:

$ docker run --rm --entrypoint /usr/bin/env fpco/pid1 /sbin/pid1 sleep 60 ^C$

And if you have your own Docker image and you'd just like to include the pid1 executable, you can download it from the Github releases page.

Dockerfiles, command vs exec form

You may be tempted to put something like ENTRYPOINT /sbin/pid1 in your Dockerfile. Let's see why that won't work:

$ cat Dockerfile FROM fpco/pid1 ENTRYPOINT /sbin/pid1 $ docker build --tag test . Sending build context to Docker daemon 2.048 kB Step 1 : FROM fpco/pid1 ---> aef1f7b702b9 Step 2 : ENTRYPOINT /sbin/pid1 ---> Using cache ---> f875b43a9e40 Successfully built f875b43a9e40 $ docker run --rm test ps pid1: No arguments provided

The issue here is that we specified /sbin/pid1 in what Docker calls command form. This is just a raw string which is interpreted by the shell. It is unable to be passed an additional command (like ps), and therefore pid1 itself complains that it hasn't been told what to run. The correct way to specify your entrypoint is ENTRYPOINT ["/sbin/pid1"], e.g.:

$ cat Dockerfile FROM fpco/pid1 ENTRYPOINT ["/sbin/pid1"] $ docker build --tag test . Sending build context to Docker daemon 2.048 kB Step 1 : FROM fpco/pid1 ---> aef1f7b702b9 Step 2 : ENTRYPOINT /sbin/pid1 ---> Running in ba0fa8c5bd41 ---> 4835dec4aae6 Removing intermediate container ba0fa8c5bd41 Successfully built 4835dec4aae6 $ docker run --rm test ps PID TTY TIME CMD 1 ? 00:00:00 pid1 8 ? 00:00:00 ps

Generally speaking, you should stick with command form in your Dockerfiles at all times. It is explicit about whitespace handling, and avoids the need to use a shell as an interpreter.

Takeaways

The main takeaway here is: unless you have a good reason to do otherwise, you should use a minimal init process like pid1. The Phusion/my_init approach works, but may be too heavy weight for some. If you don't need syslog and other add-on features of Phusion, you're probably best with a minimal init instead.

As a separate but somewhat related comment: we're going to have a follow up post on this blog in the coming days explaining how we compiled the pid1 executable as a static executable to make it compatible with all various Linux flavors, and how you can do the same for your Haskell executables. Stay tuned!

Categories: Offsite Blogs