News aggregator

Functional Jobs: Software Engineer (Haskell/Clojure) at Capital Match (Full-time)

Planet Haskell - Fri, 08/12/2016 - 5:05am

Overview

Capital Match is a leading marketplace lending and invoice financing platform in Singapore. Our in-house platform, mostly developed in Haskell, has in the last year seen more than USD 10 million business loans processed with a strong monthly growth (current rate of USD 1.5-2.5 million monthly). We are also eyeing expansion into new geographies and product categories. Very exciting times!

We have just secured another funding round to build a world-class technology as the key business differentiator. The key components include credit risk engine, seamless banking integration and end-to-end product automation from loan origination to debt collection.

Responsibilities

We are looking to hire a software engineer with a minimum of 2-3 years coding experience. The current tech team includes a product manager and 3 software engineers. We are currently also in the process of hiring CTO.

The candidate should have been involved in a development of multiple web-based products from scratch. He should be interested in all aspects of the creation, growth and operations of a secure web-based platform: front-to-back features development, distributed deployment and automation in the cloud, build and test automation etc.

Background in fintech and especially lending / invoice financing space would be a great advantage.

Requirements

Our platform is primarily developed in Haskell with an Om/ClojureScript frontend. We are expecting our candidate to have experience working with a functional programming language e.g. Haskell/Scala/OCaml/F#/Clojure/Lisp/Erlang.

Deployment and production is managed with Docker containers using standard cloud infrastructure so familiarity with Linux systems, command-line environment and cloud-based deployment is mandatory. Minimum exposure to and understanding of XP practices (TDD, CI, Emergent Design, Refactoring, Peer review and programming, Continuous improvement) is expected.

We are looking for candidates that are living in or are willing to relocate to Singapore.

Offer

We offer a combination of salary and equity depending on experience and skills of the candidate.

Most expats who relocate to Singapore do not have to pay their home country taxes and the local tax rate in Singapore is more or less 5% (effective on the proposed salary range).

Visa sponsorship will be provided.

Singapore is a great place to live, a vibrant city rich with diverse cultures, a very strong financial sector and a central location in Southeast Asia.

Get information on how to apply for this position.

Categories: Offsite Blogs

FP Complete: Practical Haskell: Bitrot-free Scripts

Planet Haskell - Thu, 08/11/2016 - 8:15am

Sneak peek: Run docker run --rm -p 8080:8080 snoyberg/file-server-demo and open http://localhost:8080.

We've all been there. We need to write some non-trivial piece of functionality, and end up doing it in bash or perl because that's what we have on the server we'll be deploying to. Or because it's the language we can most easily rely on being present at a consistent version on our coworkers' machines. We'd rather use a different language and leverage more advanced, non-standard libraries, but we can't do that reliably.

One option is to create static executables or to ship around Docker images. This is great for many use cases, and we are going to have a follow-up blog post about using Docker and Alpine Linux to make such static executables. But there are at least two downsides to this approach:

  • It's not possible to modify a static executable directly. You need to have access to the source code and the tool chain used to produce it.
  • The executable is tied to a single operating system; good luck getting your Linux executable to run on your OS X machine.

Said another way: there are good reasons why people like to use scripting languages. This blog post is going to demonstrate doing some non-trivial work with Haskell, and do so with a fully reproducible and trivially installed toolchain, supported on multiple operating systems.

Why Haskell?

Haskell is a functional programming language with high performance, great safety features, and a large ecosystem of open source libraries to choose from. Haskell programs are high level enough to be readable and modifiable by non-experts, making it ideal for these kinds of shared scripts. If you're new to Haskell, learn more on haskell-lang.org.

The task

We're going to put together a simple file server with upload capability. We're going to assume a non-hostile environment (like a corporate LAN with no external network access), and therefore not put in security precautions like upload size limits. We're going to use the relatively low-level Web Application Interface instead of a web framework. While it makes the code a bit longer, there's no magic involved. Common frameworks in Haskell include Yesod and Servant. We're going to host this all with the blazingly fast Warp web server.

Get Stack

Stack is a cross-platform program for developing Haskell projects. While it has many features, in our case the most important bit is that it can:

  • Download a complete Haskell toolchain for your OS
  • Install Haskell libraries from a curated package set
  • Run Haskell source files directly as a script (we'll show how below)

Check out the Get Started page on haskell-lang.org to get Stack on your system.

The code

You can see the full source code on Github. Let's step through the important parts here.

Script interpreter

We start off our file with something that is distinctly not Haskell code:

#!/usr/bin/env stack {- stack --resolver lts-6.11 --install-ghc runghc --package shakespeare --package wai-app-static --package wai-extra --package warp -}

With this header, we've made our file executable from the shell. If you chmod +x the source file, you can run ./FileServer.hs. The first line is a standard shebang. After that, we have a comment that provides Stack with the relevant command line options. These options tell it to:

  • Use the Haskell Long Term Support (LTS) 6.11 package set. From now through the rest of time, you'll be running against the same set of packages, so no worries about your code bitrotting!
  • Install GHC, the Glasgow Haskell Compiler. LTS 6.11 indicates what version of GHC is needed (GHC 7.10.3). Once again: no bitrot concerns!
  • runghc says we'd like to run a script with GHC
  • The rest of the lines specify which Haskell library packages we depend on. You can see a full list of available libraries in LTS 6.11 on the Stackage server

For more information on Stack's script interpreter support, see the Stack user guide.

Command line argument parsing

Very often with these kinds of tools, we need to handle command line arguments. Haskell has some great libraries for doing this in an elegant way. For example, see the optparse-applicative library tutorial. However, if you want to go simple, you can also just use the getArgs function to get a list of arguments. We're going to add support for a sanity argument, which will allow us to sanity-check that running our application works:

main :: IO () main = do args <- getArgs case args of ["sanity"] -> putStrLn "Sanity check passed, ready to roll!" [] -> do putStrLn "Launching application" -- Run our application (defined below) on port 8080 run 8080 app _ -> error $ "Unknown arguments: " ++ show argsRouting

We're going to support three different routes in our application:

  • The /browse/... tree should allow you to get a directory listing of files in the current directory, and view/download individual files.
  • The /upload page accepts a file upload and writes the uploaded content to the current directory.
  • The homepage (/) should display an HTML page with a link to /browse and provide an HTML upload form targeting /upload.

Thanks to pattern matching in Haskell, getting this to work is very straightforward:

app :: Application app req send = -- Route the request based on the path requested case pathInfo req of -- "/": send the HTML homepage contents [] -> send $ responseBuilder status200 [("Content-Type", "text/html; charset=utf-8")] (runIdentity $ execHtmlT homepage) -- "/browse/...": use the file server to allow directory -- listings and downloading files ("browse":rest) -> -- We create a modified request that strips off the -- "browse" component of the path, so that the file server -- does not need to look inside a /browse/ directory let req' = req { pathInfo = rest } in fileServer req' send -- "/upload": handle a file upload ["upload"] -> upload req send -- anything else: 404 _ -> send $ responseLBS status404 [("Content-Type", "text/plain; charset=utf-8")] "Not found"

The most complicated bit above is the path modification for the /browse tree, which is something a web framework would handle for us automatically. Remember: we're doing this low level to avoid extra concepts, real world code is typically even easier than this!

Homepage content

An area that Haskell really excels at is Domain Specific Languages (DSLs). We're going to use the Hamlet for HTML templating. There are many other options in the Haskell world favoring other syntax, such as Lucid library (which provides a Haskell-based DSL), plus implementations of language-agnostic templates, like mustache.

Here's what our HTML page looks like in Hamlet:

homepage :: Html () homepage = [shamlet| $doctype 5 <html> <head> <title>File server <body> <h1>File server <p> <a href=/browse/>Browse available files <form method=POST action=/upload enctype=multipart/form-data> <p>Upload a new file <input type=file name=file> <input type=submit> |]

Note that Hamlet - like Haskell itself - uses significant whitespace and indentation to denote nesting.

The rest

We're not going to cover the rest of the code in the Haskell file. If you're interested in the details, please read the comments there, and feel free to ask questions about any ambiguous bits (hopefully the inline comments give enough clarity on what's going on).

Running

Download the FileServer.hs file contents (or copy-paste, or clone the repo), make sure the file is executable (chmod +x FileServer.hs), and then run:

$ ./FileServer.hs

If you're on Windows, you can instead run:

> stack FileServer.hs

That's correct: the same source file will work on POSIX systems and Windows as well. The only requirement is Stack and GHC support. Again, to get Stack on your system, please see the Get Started page.

The first time you run this program, it will take a while to complete. This is because Stack will need to download and install GHC and necessary libraries to a user-local directory. Once complete, the results are kept on your system, so subsequent runs will be almost instantaneous.

Once running, you can view the app on localhost:8080.

Dockerizing

Generally, I wouldn't recommend Dockerizing a source file like this; it makes more sense to Dockerize a compiled executable. We'll cover how to do that another time (though sneak preview: Stack has built in support for generating Docker images). For now, let's actually Dockerize the source file itself, complete with Stack and the GHC toolchain.

You can check out the Dockerfile on Github. That file may be slightly different from what I cover here.

FROM ubuntu:16.04 MAINTAINER Michael Snoyman

Nothing too interesting...

ADD https://github.com/Yelp/dumb-init/releases/download/v1.1.3/dumb-init_1.1.3_amd64 /usr/local/bin/dumb-init RUN chmod +x /usr/local/bin/dumb-init

While interesting, this isn't Haskell-specific. We're just using an init process to get proper handling for signals. For more information, see dumb-init's announcement blog post.

ADD https://get.haskellstack.org/get-stack.sh /usr/local/bin/ RUN sh /usr/local/bin/get-stack.sh

Stack has a shell script available to automatically install it on POSIX systems. We just download that script and then run it. This is all it takes to have a Haskell-ready system set up: we're now ready to run script interpreter based files like our FileServer.hs!

COPY FileServer.hs /usr/local/bin/file-server RUN chmod +x /usr/local/bin/file-server

We're copying over the source file we wrote and then ensuring it is executable. Interestingly, we can rename it to not include a .hs file extension. There is plenty of debate in the world around whether scripts should or should not include an extension indicating their source language; Haskell is allowing that debate to perpetuate :).

RUN useradd -m www && mkdir -p /workdir && chown www /workdir USER www

While not strictly necessary, we'd rather not run our executable as the root user, for security purposes. Let's create a new user, create a working directory to store files in, and run all subsequent commands as the new user.

RUN /usr/local/bin/file-server sanity

As I mentioned above, that initial run of the server takes a long time. We'd like to do the heavy lifting of downloading and installing during the Docker image build rather than at runtime. To make this happen, we run our program once with the sanity command line argument, so that it immediately exits after successfully starting up.

CMD ["/usr/local/bin/dumb-init", "/usr/local/bin/file-server"] WORKDIR /workdir EXPOSE 8080

Finally, we use CMD, WORKDIR, and EXPOSE to make it easier to run. This Docker image is available on Docker Hub, so if you'd like to try it out without doing a full build on your local machine:

docker run --rm -p 8080:8080 snoyberg/file-server-demo

You should be able to play with the application on http://localhost:8080.

What's next

As you can see, getting started with Haskell as a scripting language is easy. You may be interested in checking out the turtle library, which is a Shell scripting DSL written in Haskell.

If you're ready to get deeper into Haskell, I'd recommend:

FP Complete both supports the open source Haskell ecosystem, as well as provides commercial support for those seeking it. If you're interested in learning more about how FP Complete can help you and your team be more successful in your development and devops work, you can learn about what services we offer or contact us for a free consultation.

Categories: Offsite Blogs

FP Complete: Bulletproof Haskell Scripts

Planet Haskell - Thu, 08/11/2016 - 8:15am

Sneak peek: Run docker run --rm -p 8080:8080 snoyberg/file-server-demo and open http://localhost:8080.

We've all been there. We need to write some non-trivial piece of functionality, and end up doing it in bash or perl because that's what we have on the server we'll be deploying to. Or because it's the language we can most easily rely on being present at a consistent version on our coworkers' machines. We'd rather use a different language and leverage more advanced, non-standard libraries, but we can't do that reliably.

One option is to create static executables or to ship around Docker images. This is great for many use cases, and we are going to have a follow-up blog post about using Docker and Alpine Linux to make such static executables. But there are at least two downsides to this approach:

  • It's not possible to modify a static executable directly. You need to have access to the source code and the tool chain used to produce it.
  • The executable is tied to a single operating system; good luck getting your Linux executable to run on your OS X machine.

Said another way: there are good reasons why people like to use scripting languages. This blog post is going to demonstrate doing some non-trivial work with Haskell, and do so with a fully reproducible and trivially installed toolchain, supported on multiple operating systems.

Why Haskell?

Haskell is a functional programming language with high performance, great safety features, and a large ecosystem of open source libraries to choose from. Haskell programs are high level enough to be readable and modifiable by non-experts, making it ideal for these kinds of shared scripts. If you're new to Haskell, learn more on haskell-lang.org.

The task

We're going to put together a simple file server with upload capability. We're going to assume a non-hostile environment (like a corporate LAN with no external network access), and therefore not put in security precautions like upload size limits. We're going to use the relatively low-level Web Application Interface instead of a web framework. While it makes the code a bit longer, there's no magic involved. Common frameworks in Haskell include Yesod and Servant. We're going to host this all with the blazingly fast Warp web server.

Get Stack

Stack is a cross-platform program for developing Haskell projects. While it has many features, in our case the most important bit is that it can:

  • Download a complete Haskell toolchain for your OS
  • Install Haskell libraries from a curated package set
  • Run Haskell source files directly as a script (we'll show how below)

Check out the Get Started page on haskell-lang.org to get Stack on your system.

The code

You can see the full source code on Github. Let's step through the important parts here.

Script interpreter

We start off our file with something that is distinctly not Haskell code:

#!/usr/bin/env stack {- stack --resolver lts-6.11 --install-ghc runghc --package shakespeare --package wai-app-static --package wai-extra --package warp -}

With this header, we've made our file executable from the shell. If you chmod +x the source file, you can run ./FileServer.hs. The first line is a standard shebang. After that, we have a comment that provides Stack with the relevant command line options. These options tell it to:

  • Use the Haskell Long Term Support (LTS) 6.11 package set. From now through the rest of time, you'll be running against the same set of packages, so no worries about your code bitrotting!
  • Install GHC, the Glasgow Haskell Compiler. LTS 6.11 indicates what version of GHC is needed (GHC 7.10.3). Once again: no bitrot concerns!
  • runghc says we'd like to run a script with GHC
  • The rest of the lines specify which Haskell library packages we depend on. You can see a full list of available libraries in LTS 6.11 on the Stackage server

For more information on Stack's script interpreter support, see the Stack user guide.

Command line argument parsing

Very often with these kinds of tools, we need to handle command line arguments. Haskell has some great libraries for doing this in an elegant way. For example, see the optparse-applicative library tutorial. However, if you want to go simple, you can also just use the getArgs function to get a list of arguments. We're going to add support for a sanity argument, which will allow us to sanity-check that running our application works:

main :: IO () main = do args <- getArgs case args of ["sanity"] -> putStrLn "Sanity check passed, ready to roll!" [] -> do putStrLn "Launching application" -- Run our application (defined below) on port 8080 run 8080 app _ -> error $ "Unknown arguments: " ++ show argsRouting

We're going to support three different routes in our application:

  • The /browse/... tree should allow you to get a directory listing of files in the current directory, and view/download individual files.
  • The /upload page accepts a file upload and writes the uploaded content to the current directory.
  • The homepage (/) should display an HTML page with a link to /browse and provide an HTML upload form targeting /upload.

Thanks to pattern matching in Haskell, getting this to work is very straightforward:

app :: Application app req send = -- Route the request based on the path requested case pathInfo req of -- "/": send the HTML homepage contents [] -> send $ responseBuilder status200 [("Content-Type", "text/html; charset=utf-8")] (runIdentity $ execHtmlT homepage) -- "/browse/...": use the file server to allow directory -- listings and downloading files ("browse":rest) -> -- We create a modified request that strips off the -- "browse" component of the path, so that the file server -- does not need to look inside a /browse/ directory let req' = req { pathInfo = rest } in fileServer req' send -- "/upload": handle a file upload ["upload"] -> upload req send -- anything else: 404 _ -> send $ responseLBS status404 [("Content-Type", "text/plain; charset=utf-8")] "Not found"

The most complicated bit above is the path modification for the /browse tree, which is something a web framework would handle for us automatically. Remember: we're doing this low level to avoid extra concepts, real world code is typically even easier than this!

Homepage content

An area that Haskell really excels at is Domain Specific Languages (DSLs). We're going to use the Hamlet for HTML templating. There are many other options in the Haskell world favoring other syntax, such as Lucid library (which provides a Haskell-based DSL), plus implementations of language-agnostic templates, like mustache.

Here's what our HTML page looks like in Hamlet:

homepage :: Html () homepage = [shamlet| $doctype 5 <html> <head> <title>File server <body> <h1>File server <p> <a href=/browse/>Browse available files <form method=POST action=/upload enctype=multipart/form-data> <p>Upload a new file <input type=file name=file> <input type=submit> |]

Note that Hamlet - like Haskell itself - uses significant whitespace and indentation to denote nesting.

The rest

We're not going to cover the rest of the code in the Haskell file. If you're interested in the details, please read the comments there, and feel free to ask questions about any ambiguous bits (hopefully the inline comments give enough clarity on what's going on).

Running

Download the FileServer.hs file contents (or copy-paste, or clone the repo), make sure the file is executable (chmod +x FileServer.hs), and then run:

$ ./FileServer.hs

If you're on Windows, you can instead run:

> stack FileServer.hs

That's correct: the same source file will work on POSIX systems and Windows as well. The only requirement is Stack and GHC support. Again, to get Stack on your system, please see the Get Started page.

The first time you run this program, it will take a while to complete. This is because Stack will need to download and install GHC and necessary libraries to a user-local directory. Once complete, the results are kept on your system, so subsequent runs will be almost instantaneous.

Once running, you can view the app on localhost:8080.

Dockerizing

Generally, I wouldn't recommend Dockerizing a source file like this; it makes more sense to Dockerize a compiled executable. We'll cover how to do that another time (though sneak preview: Stack has built in support for generating Docker images). For now, let's actually Dockerize the source file itself, complete with Stack and the GHC toolchain.

You can check out the Dockerfile on Github. That file may be slightly different from what I cover here.

FROM ubuntu:16.04 MAINTAINER Michael Snoyman

Nothing too interesting...

ADD https://github.com/Yelp/dumb-init/releases/download/v1.1.3/dumb-init_1.1.3_amd64 /usr/local/bin/dumb-init RUN chmod +x /usr/local/bin/dumb-init

While interesting, this isn't Haskell-specific. We're just using an init process to get proper handling for signals. For more information, see dumb-init's announcement blog post.

ADD https://get.haskellstack.org/get-stack.sh /usr/local/bin/ RUN sh /usr/local/bin/get-stack.sh

Stack has a shell script available to automatically install it on POSIX systems. We just download that script and then run it. This is all it takes to have a Haskell-ready system set up: we're now ready to run script interpreter based files like our FileServer.hs!

COPY FileServer.hs /usr/local/bin/file-server RUN chmod +x /usr/local/bin/file-server

We're copying over the source file we wrote and then ensuring it is executable. Interestingly, we can rename it to not include a .hs file extension. There is plenty of debate in the world around whether scripts should or should not include an extension indicating their source language; Haskell is allowing that debate to perpetuate :).

RUN useradd -m www && mkdir -p /workdir && chown www /workdir USER www

While not strictly necessary, we'd rather not run our executable as the root user, for security purposes. Let's create a new user, create a working directory to store files in, and run all subsequent commands as the new user.

RUN /usr/local/bin/file-server sanity

As I mentioned above, that initial run of the server takes a long time. We'd like to do the heavy lifting of downloading and installing during the Docker image build rather than at runtime. To make this happen, we run our program once with the sanity command line argument, so that it immediately exits after successfully starting up.

CMD ["/usr/local/bin/dumb-init", "/usr/local/bin/file-server"] WORKDIR /workdir EXPOSE 8080

Finally, we use CMD, WORKDIR, and EXPOSE to make it easier to run. This Docker image is available on Docker Hub, so if you'd like to try it out without doing a full build on your local machine:

docker run --rm -p 8080:8080 snoyberg/file-server-demo

You should be able to play with the application on http://localhost:8080.

What's next

As you can see, getting started with Haskell as a scripting language is easy. You may be interested in checking out the turtle library, which is a Shell scripting DSL written in Haskell.

If you're ready to get deeper into Haskell, I'd recommend:

FP Complete both supports the open source Haskell ecosystem, as well as provides commercial support for those seeking it. If you're interested in learning more about how FP Complete can help you and your team be more successful in your development and devops work, you can learn about what services we offer or contact us for a free consultation.

Categories: Offsite Blogs

Functional Jobs: (Senior) Scala Developer at SAP SE (Full-time)

Planet Haskell - Thu, 08/11/2016 - 3:56am

About SAP

SAP is a market leader in enterprise application software, helping companies of all sizes and industries run better. SAP empowers people and organizations to work together more efficiently and use business insight more effectively. SAP applications and services enable our customers to operate profitably, adapt continuously, and grow sustainably.

What you'll do:

You will be a member of the newly formed Scala development experience team. You will support us with the design and development of a Scala and cloud-based business application development and runtime platform. The goal of the platform is to make cloud-based business application development in the context S/4 HANA as straight-forward as possible. The team will be distributed over Berlin, Potsdam, Walldorf and Bangalore.

Your tasks as a (Senior) Scala Developer will include:

  • Design and development of libraries and tools for business application development
  • Design and development of tools for operating business applications
  • Explore, understand, and implement most recent technologies
  • Contribute to open source software (in particular within the Scala ecosystem)

Required skills:

  • Master’s degree in computer science, mathematics, or related field
  • Excellent programming skills and a solid foundation in computer science with strong competencies in data structures, algorithms, databases, and software design
  • Solid understanding of object-oriented concepts and basic understanding functional programming concepts
  • Good knowledge in Scala, Java, C++, or similar object-oriented programming languages
  • Strong analytical skills
  • Reliable and open-minded with strong team working skills, determined to reach a goal in time as well as the ability to work independently and to prioritize
  • Ability to get quickly up-to-speed in a complex, new environment
  • Proficiency in spoken and written English

Beneficial skills

  • Ph.D. in computer science
  • Solid understanding of functional programming concepts
  • Good knowledge in Scala, OCaml, SML, or Haskell
  • Experience with Scala and Scala.js
  • Experience with meta programming in Scala, e.g., using Scala’s macro system
  • Knowledge on SAP technologies and products
  • Experiences with the design of distributed systems, e.g., using Akka

What we offer

  • Modern and innovative office locations
  • Free lunch and free coffee
  • Flexible working hours
  • Training opportunities and conference visits
  • Fitness room with a climbing wall
  • Gaming room with table tennis, kicker tables and a Playstation
  • Friendly colleagues and the opportunity to work within a highly diverse team which has expert knowledge in a wide range of technologies

Get information on how to apply for this position.

Categories: Offsite Blogs

Mark Watson: Some Haskell hacks: SPARQL queries to DBPedia and using OpenCalais web service

Planet Haskell - Tue, 08/09/2016 - 11:49am
For various personal (and a few consulting) projects I need to access DBPedia and other SPARQL endpoints. I use the hsparql Haskell library written by Jeff Wheeler and maintained by Rob Stewart. The following code snippet:


{-# LANGUAGE ScopedTypeVariables,OverloadedStrings #-}

module Sparql2 where

import Database.HSparql.Connection
import Database.HSparql.QueryGenerator

import Data.RDF hiding (triple)
import Data.RDF.TriplesGraph

simpleDescribe :: Query DescribeQuery
simpleDescribe = do
resource <- prefix "dbpedia" (iriRef "http://dbpedia.org/resource/")
uri <- describeIRI (resource .:. "Sedona_Arizona")
return DescribeQuery { queryDescribe = uri }


doit = do
(rdfGraph:: TriplesGraph) <- describeQuery "http://dbpedia.org/sparql" simpleDescribe
--mapM_ print (triplesOf rdfGraph)
--print "\n\n\n"
--print rdfGraph
mapM (\(Triple s p o) ->
case [s,p,o] of
[UNode(s), UNode(p), UNode(o)] -> return (s,p,o)
[UNode(s), UNode(p), LNode(PlainLL o2 l)] -> return (s,p,o2)
[UNode(s), UNode(p), LNode(TypedL o2 l)] -> return (s,p,o2)
_ -> return ("no match","no match","no match"))

(triplesOf rdfGraph)


main = do
results <- doit
print $ results !! 0
mapM_ print results

I find the OpenCalais web service for finding entities in text and categorizing text to be very useful. This code snippet uses the same hacks for processing the RDF returned by OpenCalais that I used in my last semantic web book:

NOTE: August 9, 2016: the following example no longer works because of API changes:


module OpenCalais (calaisResults) where

import Network.HTTP
import Network.HTTP.Base (urlEncode)

import qualified Data.Map as M
import qualified Data.Set as S

import Control.Monad.Trans.Class (lift)

import Data.String.Utils (replace)
import Data.List (lines, isInfixOf)
import Data.List.Split (splitOn)
import Data.Maybe (maybe)

import System.Environment (getEnv)

calaisKey = getEnv "OPEN_CALAIS_KEY"

escape s = urlEncode s

baseParams = "<c:params xmlns:c=\"http://s.opencalais.com/1/pred/\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"><c:processingDirectives c:contentType=\"text/txt\" c:outputFormat=\"xml/rdf\"></c:processingDirectives><c:userDirectives c:allowDistribution=\"true\" c:allowSearch=\"true\" c:externalID=\"17cabs901\" c:submitter=\"ABC\"></c:userDirectives><c:externalMetadata></c:externalMetadata></c:params>"

calaisResults s = do
key <- calaisKey
let baseUrl = "http://api.opencalais.com/enlighten/calais.asmx/Enlighten?licenseID="
++ key ++ "&content=" ++ (escape s) ++ "&paramsXML="
++ (escape baseParams)
ret <- simpleHTTP (getRequest baseUrl) >>=
fmap (take 10000) . getResponseBody
return $ map (\z -> splitOn ": " z) $
filter (\x -> isInfixOf ": " x && length x < 40)
(lines (replace "\r" "" ret))

main = do
r <- calaisResults "Berlin Germany visited by George W. Bush to see IBM plant. Bush met with President Clinton. Bush said “felt it important to step it up”"
print r

You need to have your free OpenCalais developer key in the environment variable OPEN_CALAIS_KEY. The key is free and allows you to make 50K API calls a day (throttled to four per second).
I have been trying to learn Haskell for about four years so if anyone has any useful critiques of these code examples, please speak up :-)
Categories: Offsite Blogs

Functional Jobs: Head of Data Science at Capital Match (Full-time)

Planet Haskell - Tue, 08/09/2016 - 9:04am

Overview

Capital Match is a leading marketplace lending and invoice financing platform in Singapore. Our in-house platform, mostly developed in Haskell, has in the last year seen more than USD 10 million business loans processed with a strong monthly growth (current rate of USD 1.5-2.5 million monthly). We are also eyeing expansion into new geographies and product categories. Very exciting times!

We have just secured another funding round to build a world-class technology as the key business differentiator. The key components include credit risk engine, seamless banking integration and end-to-end product automation from loan origination to debt collection.

Complementing our technology, we aspire to put data science at the core of everything we do.

Responsibilities

We are looking to hire an experienced data scientist / software engineer with passion for data for a role of the Head of Data Science. Data science needs to impact every stage of our work flow including customer acquisition, operational automation, risk and underwriting, portfolio servicing, marketing and product development.

We have functional teams in every major area (sales, credit, tech, product) and the Head of Data Science would be a cross-functional role improving decision making processes and outcomes across the company. The person would report directly to CEO and the Board of Directors.

The candidate would be expected to:

  • Lead a bold agenda around the use of transaction data in new creative ways
  • Work with multiple, complex data sources at large scale
  • Utilize big data and machine learning to build predictive models including but not limited to customer acquisition, credit risk, fraud, marketing etc.
  • Perform thorough testing and validation of models and support various aspects of the business with data analytics
  • Identify new data sources / patterns that add significant lift to predictive modeling capabilities

We are looking for an individual who is eager to use data to come up with new ideas to improve decisions, who is driven by making impact through actionable insights and improvements, and who is not afraid to take risks and try new things.

Background in fintech and especially lending / invoice financing space would be a great advantage.

Requirements

Minimum 5 years of coding, machine learning and large scale data analysis experience.

Our platform is primarily developed in Haskell with an Om/ClojureScript frontend. We are expecting our candidate to have experience working with a functional programming language e.g. Haskell/Scala/OCaml/F#/Clojure/Lisp/Erlang.

Deployment and production is managed with Docker containers using standard cloud infrastructure so familiarity with Linux systems and command-line environment would be helpful. Minimum exposure to and understanding of XP practices (TDD, CI, Emergent Design, Refactoring, Peer review and programming, Continuous improvement) is expected.

We are looking for candidates that are living in or are willing to relocate to Singapore.

Offer

We offer a combination of salary and equity that strongly depends on the candidate's experience and skills:

Salary: USD 5,000-10,000 / month

Equity: 0.5-1.5% (subject to vesting)

Most expats who relocate to Singapore do not have to pay their home country taxes and the local tax rate in Singapore is more or less 5% (effective on the proposed salary range).

Visa sponsorship will be provided.

Singapore is a great place to live, a vibrant city rich with diverse cultures, a very strong financial sector and a central location in Southeast Asia.

Get information on how to apply for this position.

Categories: Offsite Blogs

Manuel M T Chakravarty: Learning Haskell

Planet Haskell - Sun, 08/07/2016 - 10:04pm

We just published the seventh chapter of our new tutorial for Learning Haskell. The tutorial combines clear explanations with sceencasts reinforcing new concepts with live coding. Several chapters use graphics programming to make for more engaging coding examples.

Categories: Offsite Blogs

Stefan Jacholke: Go on till you come to the end

Planet Haskell - Sun, 08/07/2016 - 4:00pm

A conclusion on the FunBlocks HSOC project

With the conclusion of Haskell Summer of Code it’s time to wrap up everything and talk about the result.

With this post, I’ll go through my Haskell Summer of Code 2016 project, what it entailed and maybe a few thoughts on the subject.

A quick review

The project entails the creation of a functional block-based programming language for CodeWorld. The aim is to create a user-friendly educational environment where students new to the concept of programming can familiarize themselves. Once students reach a certain maturity they graduate to the text-based CodeWorld editor.

The project roughly comprises out of the following:

  • A block based user interface in order to drag and drop blocks. (We use a modified Blockly)
  • Code generation in order to see the result of dragging and dropping blocks.
  • Integration with CodeWorld in order to create CodeWorld applications and run them.

The CodeWorld language is embedded within Haskell and for this project we generate a subset of it.

In order to showcase the project, I’ll go through some of the features.

Polymorphic blocks

Polymorphism is quite important and was required, for even a basic if statement requires it.

For an if we have the type Bool -> a -> a -> a.

Inputs are represented on the right side (external) or otherwise within the block (internal). The output type is given on the left side of the block.

Internal style block:

External style block:

Taking the if block as an example:

We have a polymorphic type indicator for the types. The color being the same indicates all of the types are the same type variable.

Once a block gets get connected, the most general type might be monomorphic, which then updates the shape of the block:

We can see that this is constraining the type variable a to Number

Hindley-Milner Type Inference

I think I should’ve implemented this at the start. Throughout the project, unification was a constant problem, with some blocks not always taking on the correct type in the given context. Much roundabout effort was spent on something that was replaced at the end.

What is Hindley-Milner ?

It is a simple type system for lambda calculus. For this project, algorithm W was implemented in order to infer the types. A simple system of tagging was used. We tag the connection types as the algorithm goes along, and at the end apply the required substitutions.

Each block is given a lambda expression. This is usually just function application to the base CodeWorld function. For example we might have a circle block with the radius being 3:

(circle)(3)

with circle having the type circle : Number -> Picture. Some blocks have more advanced expressions, such as the list block:

a:(b:(c:[]))

with a, b, and c being tagged with the correct connector block, and [] simply being the empty list and is a placeholder.

In the above list block, the polymorphic connectors in light blue would correspond to a, b and c.

When we have an unconnected input we fill it with a placeholder undefined, with type forall a. a. This forces the expression to take its type as the output.

With the type system in place, it is much simpler to support new blocks, as we simply have to define its expression. The previously an ad-hoc approach was used which was based on most general unification. This involved specific block-logic being scattered all over the place, in order to determine the type correctly.

CodeWorld Blocks

Since this project is for CodeWorld we do a few things differently from Haskell:

  • We add syntax symbols to blocks
  • We follow the CodeWorld convention of having types tupled, for example, a function call would be in the following form: f(10, 20)

CodeWorld makes learning programming easier by making it more visual. Students can create pictures and animations by combining various functions. We provide the same set of functions in the Blocks UI.

In order to build a simple tree the user snaps together some blocks:

In order to produce:

An advantage of using a block based environment to do this is that it visually indicates the user what is valid and what the program expects. Using different connector shapes the user can see that a picture is required for the drawingOf block, and that the colored block takes a picture and a color and outputs a picture. This also makes it much more difficult to build invalid programs.

Some more tricky blocks to implement were the function block and the list comprehension block, as they sometimes had odd edge cases (especially in the old type system).

Local Variables

Another thing we do differently from Blockly is to only allow instantiation of variables from the defining parent.

In order to create the x variable in the picture above, the user can click and drag the white shaded x (next to var). These variables also can’t be connected to somewhere, where they are outside of their defining scope.

User-defined data types

Users are able to create their own data types. They are based on algebraic data types (they have to be, since it’s Haskell).

In order to create these, the user drags a data block with which he can add sum components. A product component can then be attached.

A simple example would be

data Temperature = Degrees Number | Fahrenheit Number

which is given in the Blocks environment as:

Ideally, a single block would have captured the idea of an algebraic data types. More work is needed on Blockly in order to support mixing inline input styles with external input styles.

Once a user data type is created, corresponding constructor blocks, and case blocks are added dynamically to the toolbox.

Currently, there is no support for higher kinded types.

Code generation

The overall flow of the program is something akin to:

These backend components are implemented in Haskell, which run in the browser thanks to GHCJS.

Once a change occurs on the Blockly workspace we update on the event and build a simple AST, which is something similar to:

data Expr = LiteralS T.Text | LiteralN Float | LocalVar T.Text | CallFun T.Text [Expr] | FuncDef T.Text [T.Text] Expr ...

Once we have this, the code generation is simple and is mostly a matter of formatting.

Miscellaneous

Some other things that were implemented:

  • Some support for first class functions. Functions can be used as inputs. The most important use case here is that the animationOf, interactionOf, simulationOf program blocks require functions that specify the new state of the world, event handling, drawing the world and so on.
  • Project management - Projects can be saved and loaded. Programs can also be shared by a URL. A really simple example.
  • Some basic help, which allows users to click on programs and load them into the workspace.

Some work also went into unseen things:

  • Improving performance
  • Improving usability, stability, user-friendliness and error messages
  • Ensuring that only (mostly) valid programs can be built.
  • The gazillion bug fixes
HSOC

This project was part of this years Haskell Summer of Code initiative, which encourages contribution to the Haskell community.

A big thanks to Edward Kmett, Ryan Trinkle and others who helped organize this year’s program.

Another big thanks to my mentor Chris Smith, for his thorough bug tracking, -reporting and overall guidance. CodeWorld is his educational environment and I’m glad to have contributed a component to it.

Conclusion

Quite a lot of work went into the functional Blockly fork. The CodeWorld functionality and -blocks are kept apart from the Blockly code, with the Blockly fork mostly covering the functional features.

I’m not really sure how to measure the amount of work done. Some metrics at the time of writing:

Lines of code would have been inflated for Blockly, as it would have included the build changes. Haskell also produces fewer lines of code.

The project was great and I enjoyed it. It has a few issues and bugs that still need work. Overall I think it is in a good state for future work and can be used as a foundation.

Some further work might include:

  • Finishing up user data types
  • A guard block or guard functionality. Nesting ifs can get tedious.
  • Mixing internal and external inputs
  • Live previewing of values
  • Pattern matching

I suggest you check out the result for yourself at: code.world/blocks. If you find any bugs or missing features, kindly help us out and file an issue on the GitHub tracker.

Categories: Offsite Blogs

Brent Yorgey: POGIL workshop

Planet Haskell - Sun, 08/07/2016 - 2:54pm

A few weeks ago I attended a 3-day training workshop in St. Louis, put on by the POGIL project. I attended a short POGIL session at the SIGCSE CS education conference in March and was sufficiently impressed to sign up for a training workshop (it didn’t hurt that Clif Kussmaul has an NSF grant that paid for my registration and travel).

POGIL is an acronym for “Process Oriented Guided Inquiry Learning”. Process-oriented refers to the fact that in addition to learning content, an explicit goal is for students to learn process skills like analytic thinking, communication, and teamwork. Guided inquiry refers to the fact that students are responsible for constructing their own knowledge, guided by carefully designed questions. The entire framework is really well thought-out and is informed by concrete research in pedagogical methods. I really enjoyed how the workshop used the POGIL method to teach us about POGIL (though of course it would be rather suspect to do anything else!). It gave me not just an intellectual appreciation for the benefits of the approach, but also a concrete understanding of the POGIL experience for a student.

The basic idea is to put students in groups of 3 or 4 and have them work through an activity or set of questions together. So far this sounds just like standard “group work”, but it’s much more carefully thought out than that:

  • Each student is assigned a role with specific responsibilities within their group. Roles typically rotate from day to day so each student gets a chance to play each role. Roles can vary but common ones include things like “manager”, “recorder”, “reporter”, and so on. I didn’t appreciate how important the roles are until attending the workshop, but they are really crucial. They help ensure every student is engaged, forestall some of the otherwise inevitable social awkwardness as students figure out how to relate to their group members, and also play an important part in helping students develop process skills.

  • The activities are carefully constructed to take students through one or more learning cycles: beginning with some data, diagrams, text, etc. (a “model”), students are guided through a process starting with simple observations, then synthesis and discovering underlying concepts, and finally more open ended/application questions.

The teacher is a facilitator: giving aid and suggestions as needed, managing dificulties that arise, giving space and time for groups to report on their progress and share with other groups, and so on. Of course, a lot of work goes into constructing the activities themselves.

In some areas, there is already a wealth of POGIL activities to choose from; unfortunately, existing materials are a bit thinner in CS (though there is a growing collection). I won’t be able to use POGIL much this coming semester, but I hope to use it quite a bit when I teach algorithms again in the spring.


Categories: Offsite Blogs

Roman Cheplyaka: Does it matter if Hask is (not) a category?

Planet Haskell - Sun, 08/07/2016 - 2:00pm

Andrej Bauer raises a question whether Hask is a real category. I think it’s a legitimate question to ask, especially by a mathematician or programming languages researcher. But I want to look closer at how a (probably negative) answer to this question would affect Haskell and its community.

To illustrate the fallacy of assuming blindly that Hask is a category, Andrej tells an anecdote (which I find very funny):

I recall a story from one of my math professors: when she was still a doctoral student she participated as “math support” in the construction of a small experimental nuclear reactor in Slovenia. One of the physicsts asked her to estimate the value of the harmonic series \(1+1/2+1/3+\cdots\) to four decimals. When she tried to explain the series diverged, he said “that’s ok, let’s just pretend it converges”.

Presumably here is what happened:

  1. The physicists came up with a mathematical model of a nuclear reactor.
  2. The model involved the sum of the harmonic series.
  3. Andrej’s math professor tried to explain that the series diverged and therefore something was wrong with the model.

When we try to model a phenomenon, we should watch out for two types of problems:

  1. The model itself is erroneous.
  2. The model itself is fine; but the phenomenon we are describing does not meet all of the model’s assumptions.

The first type of problem means that the people who built the model couldn’t get their math right. That’s too bad. We let mathematicians to gloss over the messy real world, to impose whatever assumptions they want, but in return we expect a mathematically rigorous model upon which we can build. In Andrej’s story, hopefully the math support lady helped the physicists build a better model that didn’t rely on the convergence of the harmonic series.

But at some point the model has to meet the real world; and here, the issues are all but inevitable. We know that all models are wrong (meaning that they don’t describe the phenomenon ideally, not that they are erroneous) — but some are useful.

Physicists, for example, often assume that they are dealing with isolated systems, while being perfectly aware that no such system exists (except, perhaps, for the whole universe, which would be impossible to model accurately). Fortunately, they still manage to design working and safe nuclear reactors!

Consider Hask. Here, the abstraction is the notion of a category, and the phenomenon is the programming language Haskell. If types and functions of Haskell do not form a proper category, we have the second type of modelling problem. The foundation — the category theory — is, to the best of my knowledge, widely accepted among mathematicians as a solid theory.

Since category theory is often used to model other purely mathematical objects, such as groups or vector spaces, mathematicians may get used to a perfect match between the abstraction and the phenomenon being described. Other scientists (including computer scientists!) can rarely afford such luxury.

Usefulness is the ultimate criterion by which we should judge a model. We use monads in Haskell not because they are a cool CT concept, but because we tried them and found that they solve many practical problems. Comonads, which from the CT standpoint are “just” the dual of monads, have found much fewer applications, not because we found some kind of theoretical problems with them — we simply didn’t find that many problems that they help address. (To be fair, we tried hard, and we did manage to find a few.)

There are people who, inspired by some category theory constructions, come up with novel algorithms, data structures, or abstractions for Haskell. For these discoveries to work, it is neither necessary nor sufficient that they correspond perfectly to the original categorical abstractions they were derived from. And as long as playing with the “Hask category” yields helpful intuition and working programming ideas, we are going to embrace it.

Categories: Offsite Blogs

Snap Framework: Announcing: Snap 1.0

Planet Haskell - Sun, 08/07/2016 - 12:22pm

The Snap team is delighted to announce the anxiously awaited release of version 1.0 of the Snap Web Framework for Haskell. Snap has been used in stable production applications for years now, and with this release we’re updating our version number to reflect the stability and commitment to backwards compatibility that our users depend on. Here is a summary of the major changes:

The DetailsNow backed by io-streams

Snap’s web server has been overhauled, replacing the enumerator package with the newer, leaner, faster, and easier to use io-streams. If you were using of Snap’s low-level enumerator functions, those will need to be migrated to io-streams. Otherwise there should be few interface changes.

More modular project template infrastructure

The snap executable that generates project templates has been moved from the snap package to snap-templates. Your snap applications depending on snap will continue to do so, but with a slightly lighter set of transitive dependencies. If you want to run snap init to generate a project template, you will now need to do cabal install snap-templates first instead of cabal install snap.

Migration Guide
  • Change your cabal files to depend on monad-control instead of MonadCatchIO-transformers.

  • Instead of deriving the MonadCatchIO type class, you should now make MonadBaseControl instances. Depending on your monad, this may require MonadBase and MonadTransControl instances as well. For examples of how to do that for common monad structures, look at Heist and snap (here, here, and here).

  • Any exception handling functions like try, catch, etc you were using from Control.Monad.CatchIO should now come from Control.Exception.Lifted which is provided by the lifted-base package.

  • initCookieSessionManager takes an additional Maybe ByteString argument representing an optional cookie domain. Passing Nothing as the new argument will give the same behavior as you had before.

Outline

The Snap Framework is composed of five major packages:

  • snap-core - A simple and stable web server API.

  • snap-server - A robust and well tested web server implementing the snap-core API.

  • heist - An HTML 5 template system allowing designers to make changes to markup without needing to have a Haskell toolchain installed and recompile the app.

  • snap - Umbrella project that integrates the above three packages, provides a snaplet system for building reusable web components, and includes built-in snaplets for common things like sessions, auth, templating, etc.

  • snap-templates - Provides an executable for generating Snap project templates.

Acknowledgments

We would like to thank the dozens of contributors who have helped over the years to get Snap to this milestone. Particular thanks go to Greg Hale who has been instrumental in getting us across the finish line for this release.

Categories: Offsite Blogs

Philip Wadler: Category Theory for the Working Hacker

Lambda the Ultimate - Sun, 08/07/2016 - 11:26am

Nothing you don't already know, if you are inteo this sort of thing (and many if not most LtU-ers are), but a quick way to get the basic idea if you are not. Wadler has papers that explain Curry-Howard better, and the category theory content here is very basic -- but it's an easy listen that will give you the fundamental points if you still wonder what this category thing is all about.

To make this a bit more fun for those already in the know: what is totally missing from the talk (understandable given time constraints) is why this should interest the "working hacker". So how about pointing out a few cool uses/ideas that discerning hackers will appreciate? Go for it!

Categories: Offsite Discussion

Dan Piponi (sigfpe): Dimensionful Matrices

Planet Haskell - Sat, 08/06/2016 - 8:23pm
Introduction

Programming languages and libraries for numerical work tend not to place a lot of emphasis on the types of their data. For example Matlab, R, Octave, Fortran, and Numpy (but not the now defunct Fortress) all tend to treat their data as plain numbers meaning that any time you have a temperature and a mass, say, there is nothing to prevent you adding them.


I've been wondering how much dimensions (in the sense of dimensional analysis) and units could help with numerical programming. As I pointed out on G+ recently (which is where I post shorter stuff these days), you don't have to limit dimensions to the standard ones of length, mass, time, dollars and so on. Any scale invariance in the equations you're working with can be exploited as a dimension giving you a property that can be statically checked by a compiler.


There are quite a few libraries to statically check dimensions and units now. For example Boost.Units for C++, units for Haskell and even quantities for Idris.


A matrix that breaks things

Even if a language supports dimensions, it's typical to define objects like vectors and matrices as homogeneous containers of quantities. But have a look at the Wikipedia page on the metric tensor. There is a matrix



which has the curious property that 3 entries on the diagonal seem to be dimensionless while the first entry is a squared velocity with dimension . This will break many libraries that support units. An obvious workaround is to switch to use natural units, which is much the same as abandoning the usefulness of dimensions. But there's another way, even if it may be tricky to set up with existing languages.


Heterogeneous vectors and matrices

According to a common convention in physics, a 4-vector has dimensions where I'm using the convention that we can represent the units of a vector or matrix simply as a vector or matrix of dimensions, and here is time and is length. The metric tensor is used like this: (where I'm using the Einstein summation convention so the 's and 's are summed over). If we think of having units of length squared (it is a pseudo-Riemannian metric after all) then it makes sense to think of having dimensions given by



We can write this more succinctly as



where is the usual outer product.


I'll use the notation to mean is of type . So, for example, . I'll also use pointwise notation for types such as and .


Now I can give some general rules. If is a matrix, and are vectors, and is a scalar, then only makes sense if . Similarly the "inner product" only makes sense if .


Generic vectors and matrices

Although these kinds of types might be useful if you're dealing with the kind of heterogeneous matrices that appear in relativity, there's another reason they might be useful. If you write code (in the imaginary language that supports these structures and understands dimensions and units) to be as generic as possible in the types of the vector and matrix entries, failures to type check will point out parts of the code where there are hidden assumptions, or even errors, about scaling. For example, consider a routine to find the inverse of a 3 by 3 matrix. Writing this generically as possible means we should write it to operate on a matrix of type , say. The result should have type . If this type checks when used with a suitably powerful type checker then it means that if we replace the units for type A, say, with units twice as large, it should have no effect on the result, taking into account those units. In this case, it means that if we multiply the numbers of the first row of the input by 0.5 then the numbers of the first column of the output should get multiplied by 2. In fact this is a basic property of matrix inverses. In other words, this mathematical property of matrix inverses is guaranteed by a type system that can handle units and heterogeneous matrices. It would be impossible to write a matrix inverter that type checks and fails to have this property. Unfortunately it's still possible to write a matrix inverter that type checks and is incorrect some other way. Nonetheless this kind of type system would put a very big constraint on the code and is likely to eliminate many sources of error.


An example, briefly sketched

I thought I'd look at an actual example of a matrix inverter to see what would happen if I used a type checker like the one I've described. I looked at the conjugate gradient method. At the Wikipedia page, note the line



This would immediately fail to type check because if is of generic vector type then isn't the same type as so they can't be added. I won't go into any of the details but the easiest way to patch up this code to make it type check is to introduce a new matrix of type and besides using it to make this inner product work (replacing the numerator by ) we also use anywhere in the code we need to convert a vector of type to a vector of type . If you try to do this as sparingly as possible you'll end up with a modified algorithm. But at first this seems weird. Why should this matrix inverse routine rely on someone passing in a second matrix to make it type check? And what is this new algorithm anyway? Well scroll down the Wikipedia page and you get to the preconditioned conjugate gradient algorithm. The extra matrix we need to pass in is the preconditioner. This second algorithm would type check. Preconditioned conjugate gradient, with a suitable preconditioner, generally performs better than pure conjugate gradient. So in this case we're getting slightly more than a check on our code's correctness. The type checker for our imaginary language would give a hint on how to make the code perform better. There's a reason for this. The original conjugate gradient algorithm is implicitly making a choice of units that sets scales along the axes. These determine the course taken by the algorithm. It's not at all clear that picking these scalings randomly (which is in effect what you're doing if you throw a random problem at the algorithm) is any good. It's better to pick a preconditioner adapted to the scale of the problem and the type checker is hinting (or would be if it existed) that you need to do this. Compare with the gradient descent algorithm whose scaling problems are better known.


But which language?

I guess both Agda and Idris could be made to implement what I've described. However, I've a hunch it might not be easy to use in practice.

Categories: Offsite Blogs

Jan Stolarek: First impression of “Real World OCaml”

Planet Haskell - Sat, 08/06/2016 - 10:02am

Tomorrow I will be flying to Cambridge to attend International Summer School on Metaprogramming. One of the prerequisites required from the participants is basic knowledge of OCaml, roughly the first nine chapters of “Real World OCaml” (RWO for short). I finished reading them several days ago and thought I will share my impressions about the book.

RWO was written by Yaron Minsky, Anil Madhavapeddy and Jason Hickey. It is one of a handful of books on OCaml. Other titles out there are “OCaml from the Very Beginning” and “More OCaml: Algorithms, Methods and Diversions” by John Whitington and “Practical OCaml” by Joshua Smith. I decided to go with RWO because when I asked “what is the best book on OCaml” on #ocaml IRC channel RWO was an unanimous response from several users. The title itself is obviously piggybacking on an earlier “Real World Haskell” released in the same series by O’Reilly, which was in general a good book (though it had its flaws).

The first nine chapters comprise about 40% of the book (190 pages out of 470 total) and cover the basics of OCaml: various data types (lists, records, variants), error handling, imperative programming (eg. mutable variables and data structures, I/O) and basics of the module system. Chapters 10 through 12 present advanced features of the module system and introduce object-oriented aspects of OCaml. Language ecosystem (ie. tools and libraries) is discussed in chapters 13 through 18. The remaining chapters 19 through 23 go into details of OCaml compiler like garbage collector or Foreign Function Interface.

When I think back about reading “Real World Haskell” I recall that quite a lot of space was dedicated to explaining in detail various basic functional programming concepts. “Real World OCaml” is much more dense. It approaches teaching OCaml just as if it was another programming language, without making big deal of functional programming model. I am much more experienced now than when reading RWH four years ago and this is exactly what I wanted. I wonder however how will this approach work for people new to functional programming. It reminds my of my early days as a functional programmer. I began learning Scala having previously learned Scheme and Erlang (both unusual for functional languages in lacking a type system). Both Scala and OCaml are not pure functional languages: they allow free mixing of functional and imperative (side-effecting) code. They also support object-oriented programming. My plan in learning Scala was to learn functional programming and I quickly realized that I was failing. Scala simply offered too many back-doors that allowed escaping into the imperative world. So instead of forcing me to learn a new way of thinking it allowed me to do things the old way. OCaml seems to be exactly the same in this regard and RWO offers beginners little guidance to thinking functionally. Instead, it gives them a full arsenal of imperative features early on in the book. I am not entirely convinced that this approach will work well for people new to FP.

“Real World OCaml” was published less than three years ago so it is a fairly recent book. Quite surprisingly then several sections have already gone out of date. The code does not work with the latest version of OCaml compiler and requires non-obvious changes to work. (You can of course solve the problem by working with the old version of OCaml compiler.) I was told on IRC that the authors are already working on the second edition of the book to bring it to date with today’s OCaml implementation.

Given all the above my verdict on “Real World OCaml” is that it is a really good book about OCaml itself (despite being slightly outdated) but not necessarily the best book on basics of functional programming.

Categories: Offsite Blogs

Brent Yorgey: New Haskell Symposium paper on “twisted functors”

Planet Haskell - Thu, 08/04/2016 - 2:08pm

Satvik Chauhan, Piyush Kurur and I have a new paper which will appear at the 2016 Haskell Symposium in Japan:

How to Twist Pointers without Breaking Them

Although pointer manipulations are used as a primary motivating example, at heart the paper is really about “twisted functors”, a class of applicative functors which arise as a natural generalization of the semi-direct product of two monoids where one acts on the other. It’s a really cute idea1, one of those ideas which seems “obvious” in retrospect, but really hadn’t been explored before.

We give some examples of applications in the paper but I’m quite certain there are many other examples of applications out there. If you find any, let us know!

  1. I can say that since it wasn’t actually my idea!


Categories: Offsite Blogs

Neil Mitchell: Upcoming talk: Writing build systems with Shake, 16 Aug 2016, London

Planet Haskell - Wed, 08/03/2016 - 5:22pm
Summary: I'm giving a talk on Shake.

I'm delighted to announce that I'll be giving a talk/hack session on Shake as part of the relatively new "Haskell Hacking London" meetup.

Title: Writing build systems with Shake

Date: Tuesday, August 16, 2016. 6:30 PM

Location: Pusher Office, 28 Scrutton Street, London

Abstract: Shake is a general purpose library for expressing build systems - forms of computation, with caching, dependencies and more besides. Like all the best stuff in Haskell, Shake is generic, with details such as "files" written on top of the generic library. Of course, the real world doesn't just have "files", but specifically has "C files that need to be compiled with gcc". In this hacking session we'll look at how to write Shake rules, what existing functions people have already layered on top of Shake for compiling with specific compilers, and consider which rules are missing. Hopefully by the end we'll have a rule that people can use out-of-the-box for compiling C++ and Haskell.

To put it another way, it's all about layering up. Haskell is a programming language. Shake is a Haskell library for dependencies, minimal recomputation, parallelism etc. Shake also provides as a layer on top (but inside the same library) to write rules about files, and ways to run command line tools. Shake doesn't yet provide a layer that compiles C files, but it does provide the tools with which you can write your own. The aim of this talk/hack session is to figure out what the next layer should be, and write it. It is definitely an attempt to move into the SCons territory of build systems, which knows how to build C etc. out of the box.
Categories: Offsite Blogs

JP Moresmau: Scratching an itch: generating Cabal data-files field automatically

Planet Haskell - Wed, 08/03/2016 - 12:26pm
Maybe I didn't look hard enough, but I'm not aware of a tool to generate the contents of the Cabal data-files field automatically when you have loads of folders to include. Cabal has very simple wildcard matching for files references in this field, by design (to avoid including too much data in a source distribution). So it only supports wildcards to replace the file name inside a directory for a given extension, and doesn't support sub directories.

For the reload project - first release on Hackage! - I had to include loads of files, all the Polymer web components the UI depends on, which are all on different sub directories, with a bunch of different extensions. So I wrote a little tool to generate the field automatically, and put it on Hackage too.

You pass it a directory name, possibly some subdirectories and extensions to ignore, and it generates all the required entries. Saved me loads of time, and scratched my own itch!
Categories: Offsite Blogs

mightybyte: How to Get a Haskell Job

Planet Haskell - Wed, 08/03/2016 - 8:02am

Over and over again I have seen people ask how to get a full time job programming in Haskell. So I thought I would write a blog post with tips that have worked for me as well as others I know who write Haskell professionally. For the impatient, here's the tl;dr in order from easiest to hardest:

  1. IRC
  2. Local meetups
  3. Regional gatherings/hackathons
  4. Open source contributions
  5. Work where Haskell people work

First, you need to at least start learning Haskell on your own time. You had already started learning how to program before you got your first programming job. The same is true of Haskell programming. You have to show some initiative. I understand that for people with families this can be hard. But you at least need to start. After that, far and away the most important thing is to interact with other Haskell developers so you can learn from them. That point is so important it bears repeating: interacting with experienced Haskell programmers is by far the most important thing to do. Doing this at a job would be the best, but there are other things you can do.

1. IRC. Join the #haskell channel on Freenode. Lurk for awhile and follow some of the conversations. Try to participate in discussions when topics come up that interest you. Don't be afraid to ask what might seem to be stupid questions. In my experience the people in #haskell are massively patient and willing to help anyone who is genuinely trying to learn.

2. Local meetups. Check meetup.com to see if there is a Haskell meetup in a city near you. I had trouble finding a local meetup when I was first learning Haskell, but there are a lot more of them now. Don't just go to listen to the talks. Talk to people, make friends. See if there's any way you can collaborate with some of the people there.

3. Larger regional Haskell events. Find larger weekend gatherings of Haskell developers and go to them. Here are a few upcoming events that I know of off the top of my head:

The first event like this that I went to was Hac Phi a few years back. Going there majorly upped my game because I got to be around brilliant people, pair program with some of them, and ultimately ended up starting the Snap Web Framework with someone I met there. You might not have a local meetup that you can go to, but you can definitely travel to go to one of these bigger weekend events. I lived a few hours away from Hac Phi, but I know a number of people who travel further to come. If you're really interested in improving your Haskell, it is well worth the time and money. I cannot emphasize this enough.

4. Start contributing to an open source Haskell project. Find a project that interests you and dive in. Don't ask permission, just decide that you're going to learn enough to contribute to this thing no matter what. Join their project-specific IRC channel if they have one and ask questions. Find out how you can contribute. Submit pull requests. This is by far the best way to get feedback on the code that you're writing. I have actually seen multiple people (including some who didn't strike me as unusually talented at first) start Haskell and work their way up to a full-time Haskell job this way. It takes time and dedication, but it works.

5. Try to get a non-haskell job at a place where lots of Haskell people are known to work. Standard Chartered uses is Haskell but is big enough to have non-Haskell jobs that you might be able to fit. S&P Capital IQ doesn't use Haskell but has a significant number of Haskell people who are coding in Scala.

Categories: Offsite Blogs

Ketil Malde: CAS-based generic data store

Planet Haskell - Wed, 08/03/2016 - 4:00am

Bioinformatics projects routinely generate terabytes of sequencing data, and the inevitable analysis that follows can easily increase this by an order of magnitude or more. Not everything is worth keeping, but in order to ensure reproducibility and to be able to reuse data in new projects, it is important to store what needs to be kept in a structured way.

I have previously described and implemented a generic data store, called medusa. Following the eXtreme Programming principle of always starting with the simplest implementation that could possibly work, the system was designed around a storage based on files and directories. This has worked reasonably well, and makes data discoverable and accessible both directly in the file system, and through web-based services providing browsing, metadata search (with free text and keyword based indexes), BLAST search, and so forth.

Here, I explore the concept of content adressable storage (CAS), which derives unique names for data objects from their content.

The CAS principle

The essence of any storage system is being able to store objects with some kind of key (or label, or ID), and being able to retrive them based on the same key. What distinguishes a content adressable storage from other storage systems is that the key is generated from the entire data object, typically using a cryptographic hash function like MD5 or SHA1.

This means that a given object will always be stored under the same key, and that modifications to an object will also change its key, essentially creating a new object.

A layered model

Using CAS more clearly separates the storage model from the semantics of data sets. This gives us a layered architecture for the complete system, and services are implemented on top of these layers as independent and modular programs.

The object store

The object store is conceptually simple. It provides a simple interface that consists of the following primitive operations:

put
a data object into the store
list
the keys that refer to data objects
get
a data object using its key

The storage itself is completely oblivious to the actual contents of data objects, and it has no concept of hierarchy or other relationships between objects.

Metadata semantics

When we organize data, we do of course want to include relationships between objects, and also between data objects and external entities and concepts. This is the province of metadata. Metadata semantics are provided by special metadata objects which live in the object store like any other objects. Each metadata object defines and describes a specific data set. As in former incarnations of the system, metadata is structured as XML documents, and provides information about (and the identity of) the data objects constituting the data set. It also describes the relationship between data sets, for instance allowing new versions to obsolete older ones.

The metadata objects are primarily free-form text objects, allowing users to include whatever information they deem relevant and important. The purpose of using XML is to make specific parts of the information computationally accessible, unambiguous, and standardized. For instance, structured references (i.e. specific XML elements) to data objects with their key allows automatic retrieval of the complete dataset. In addition to referencing objects in the object store, similar structures allow unambigous references to external entities, for instance species, citation of scientific works, and uniform formatting of dates and geographic locations.

A command line interface to the metadata is provided through the `mdz` command, this allows a variety of operations on data sets, including listing, importing, exporting, and synchronizing with other repositories. In addition, the system implements a web-based front end to the data, as well as metatdata indexing via xapian.

Data objects and services

As shown in the previous sections, the system can be conceptually divided in three levels: the object store, the metadata level, and the data semantic level. A service typically accesses data on one or more of these levels. For instance, a (hypothetical) service to ensure distributed redundancy may only need to access the object store, oblivious to the contents of the objects. Other services, like the (existing) functionality to import data sets, or transfer data sets between different servers, need to understand the metadata format. And even more specific services may also need to understand the format of data objects - e.g. the BLAST service scans metadata to find FASTA-formatted sequence data, and integrate them into its own database. The important principles that services adhere to are: 1) a service can ignore anything that is irrelevant to it, and 2) can reconstruct its entire state from the contents of the object store.

Discussion CAS Advantages

Perhaps the primary advantage of using the hash value as the ID for data objects, is that it allows the system to be entirely distributed. The crucial advantage is that keys (by definition) are unique to the data. With user-selected keys, the user must somehow ensure the uniqueness of the key, and this requires a central authority or at the very least an agreed-upon naming scheme. In contrast, names for objects in CAS depend only on the contents, and the system can be implemented with no central oversight.

That keys depend on contents further means that data are immutable - storing a modified data object results in a different key. Immutability is central to reproducibility (you won't get the same results if you run your analysis with different data), and previously this was maintained by keeping a separate registry of metadata checksums, and also including checksums for data objects in the metadata. This made it possible to verify correctness (as long as the registry was available and correct), with CAS, this becomes even easier since the checksum is the same as the name you use to retrieve the data object.

Another benefit is deduplication of data objects. Objects with the same contents will always be stored under the same key, so this is automatic. This also makes it easier to track files across renames (analyses tend to produce output files with generic names like "contigs.fasta", it is often useful to give these files a more descriptive name), with CAS it becomes trivial to check if any file exists in the storage.

Decoupling the data from a fixed filesystem layout introduces another level of abstraction, and this makes it easier to change the underlying storage model. In later years, key-value storage models have replaced relational databases in many applications, in particular where high scalability is more important than structured data. Consequently, we have seen a plethora of so-called "NoSQL" databases emerge, including CouchDB, Cassandra, and many others, which could be plugged in as an alternative back-end storage. Storage "in the cloud", like Amazon's S3 or Google's Cloud Storage are also good alternatives.

The added opacity makes it less likely (but still technically possible) for users with sufficient privileges to perform "illegal" operations on data (for instance, modification or removal).

Disadvantages

The implicit assumption for CAS is that different data objects hash to different hash values. In an absolute sense, this is trivially false (since there only exist 2160 possible hash values, and an infinity of possible data objects). But it is true in a probabilistic sense, and we can calculate the probability of collisions from the birthday paradox. For practical purposes, any collision is extremely unlikely, and like the revision control system git (which also is CAS-based), collisions are checked for by the system, and can be dealt with manually if they should occur.

Abstracting out the storage layer can be an advantage, but it also makes the system more opaque. And although the ability of humans to select misleading or confusing names can hardly be underestimated, even a poorly chosen name is usually more informative than the hexadecimal key representing a hash value.

Mixed blessings

Previous versions used a fixed directory structure, where each data set included a metadata file, and an arbitrary set of data files. Using a content adressable object store is more flexible, and there is nothing preventing the implementation of a parallel metadata scheme sharing the same data store, and even referring to the same data objects. One could also create metadata objects that refer to other metadata objects. As always, fewer restrictions also means more opportunities for confusion and increased complexity.

Perhaps the most drastic change is how datasets can have their status changed - e.g. be marked as obsolete or invalid. Previously, metadata was versioned, meaning there could exist a (linear) sequence of metadata for the same dataset. This was enforced by convention only, and also required a central synchronization of metadata updates to avoid name and version collisions. Since the object store only allows the addition of new objects, and in particular, not modification, status updates can only be achieved by adding new objects. Metadata objects can refer to other datasets, and specify a context, for instance, a data set containing analysis results can specify being based on a data set containing input data for the analysis. Status changes are now implemented using this mechanism, and datasets can refer to other data sets as "invalidated" or "obsoleted".

Current Status and Availability

The system is currently working on my internal systems, it is based on standard components (mostly shell scripts), and although one might expect some rough edges, it should be fairly easy to deploy.

Do let me know if you are interested.

Categories: Offsite Blogs

mightybyte: Measuring Software Fragility

Planet Haskell - Tue, 08/02/2016 - 3:43pm
<style> .hl { background-color: orange; } </style>

While writing this comment on reddit I came up with an interesting question that I think might be a useful way of thinking about programming languages. What percentage of single non-whitespace characters in your source code could be changed to a different character such that the change would pass your CI build system but would result in a runtime bug? Let's call this the software fragility number because I think that metric gives a potentially useful measure of how bug prone your software is.

At the end of the day software is a mountain of bytes and you're trying to get them into a particular configuration. Whether you're writing a new app from scratch, fixing bugs, or adding new features, the number of bytes of source code you have (similar to LOC, SLOC, or maybe the compressed number of bytes) is rough indication of the complexity of your project. If we model programmer actions as random byte mutations over all of a project's source and we're trying to predict the project's defect rate this software fragility number is exactly the thing we need to know.

Now I'm sure many people will be quick to point out that this random mutation model is not accurate. Of course that's true. But I would argue that in this way it's similar to the efficient markets hypothesis in finance. Real world markets are obviously not efficient (Google didn't become $26 billion less valuable because the UK voted for brexit). But the efficient markets model is still really useful--and good luck finding a better one that everybody will agree on.

What this model lacks in real world fidelity, it makes up for in practicality. We can actually build an automated system to calculate a reasonable approximation of the fragility number. All that has to be done is take a project, randomly mutate a character, run the project's whole CI build, and see if the result fails the build. Repeat this for every non-whitespace character in the project and count how many characters pass the build. Since the character was generated at random, I think it's reasonable to assume that any mutation that passes the build is almost definitely a bug.

Performing this process for every character in a large project would obviously require a lot of CPU time. We could make this more tractable by picking characters at random to mutate. Repeat this until you have done it for a large enough number of characters and then see what percentage of them made it through the build. Alternatively, instead of choosing random characters you could choose whole modules at random to get more uniform coverage over different parts of the language's grammar. There are probably a number of different algorithms that could be tried for picking random subsets of characters to test. Similar to numerical approximation algorithms such as Newton's method, any of these algorithms could track the convergence of the estimate and stop when the value gets to a sufficient level of stability.

Now let's investigate actual fragility numbers for some simple bits of example code to see how this notion behaves. First let's look at some JavaScript examples.

It's worth noting that comment characters should not be allowed to be chosen for mutation since they obviously don't affect the correctness of the program. So the comments you see here have not been included in the calculations. Fragile characters are highlighted in orange.

// Fragility 12 / 48 = 0.25 function f(n) { if ( n < 2 ) return 1; else return n * f(n-1); } // Fragility 14 / 56 = 0.25 function g(n) { var p = 1; for (var i = 2; i <= n; i++ ) { p *= i; } return p; }

First I should say that I didn't write an actual program to calculate these. I just eyeballed it and thought about what things would fail. I easily could have made mistakes here. In some cases it may even be subjective, so I'm open to corrections or different views.

Since JavaScript is not statically typed, every character of every identifier is fragile--mutating them will not cause a build error because there isn't much of a build. JavaScript won't complain, you'll just start getting undefined values. If you've done a signifciant amount of JavaScript development, you've almost definitely encountered bugs from mistyped identifier names like this. I think it's mildly interesting that the recursive and iterative formulations if this function both have the same fragility. I expected them to be different. But maybe that's just luck.

Numerical constants as well as comparison and arithmetic operators will also cause runtime bugs. These, however, are more debatable because if you use the random procedure I outlined above, you'll probably get a build failure because the character would have probably changed to something syntactically incorrect. In my experience, it semes like when you mistype an alpha character, it's likely that the wrong character will also be an alpha character. The same seems to be true for the classes of numeric characters as well as symbols. The method I'm proposing is that the random mutation should preserve the character class. Alpha characters should remain alpha, numeric should remain numeric, and symbols should remain symbols. In fact, my original intuition goes even further than that by only replacing comparison operators with other comparison operators--you want to maximize the chance that new mutated character will cause a successful build so the metric will give you a worst-case estimate of fragility. There's certainly room for research into what patterns tend come up in the real world and other algorithms that might describe that better.

Now let's go to the other end of the programming language spectrum and see what the fragility number might look like for Haskell.

// Fragility 7 / 38 = 0.18 f :: Int -> Int f n | n < 2 = 1 | otherwise = n * f (n-1)

Haskell's much more substantial compile time checks mean that mutations to identifier names can't cause bugs in this example. The fragile characters here are clearly essential parts of the algorithm we're implementing. Maybe we could relate this idea to information theory and think of it as an idea of how much information is contained in the algorithm.

One interesting thing to note here is the effect of the length of identifier names on the fragility number. In JavaScript, long identifier names will increase the fragility because all identifier characters can be mutated and will cause a bug. But in Haskell, since identifier characters are not fragile, longer names will lower the fragility score. Choosing to use single character identifier names everywhere makes these Haskell fragility numbers the worst case and makes JavaScript fragility numbers the best case.

Another point is that since I've used single letter identifier names it is possible for a random identifier mutation in Haskell to not cause a build failure but still cause a bug. Take for instance a function that takes two Int parameters x and y. If y was mutated to x, the program would still compile, but it would cause a bug. My set of highlighted fragile characters above does not take this into account because it's trivially avoidable by using longer identifier names. Maybe this is an argument against one letter identifier names, something that Haskell gets criticism for.

Here's the snippet of Haskell code I was talking about in the above reddit comment that got me thinking about all this in the first place:

// Fragility 31 / 277 = 0.11 data MetadataInfo = MetadataInfo { title :: Text , description :: Text } pageMetadataWidget :: MonadWidget t m => Dynamic t MetadataInfo -> m () pageMetadataWidget i = do el "title" $ dynText $ title <$> i elDynAttr "meta" (mkDescAttrs . description <$> i) blank where mkDescAttrs desc = "name" =: "description" "content" =: desc

In this snippet, the fragility number is probably close to 31 characters--the number of characters in string literals. This is out of a total of 277 non-whitespace characters, so the software fragility number for this bit of code is 11%. This half the fragility of the JS code we saw above! And as I've pointed out, larger real world JS examples are likely to have even higher fragility. I'm not sure how much we can conclude about the actual ratios of these fragility numbers, but at the very least it matches my experience that JS programs are significantly more buggy than Haskell programs.

The TDD people are probably thinking that my JS examples aren't very realistic because none of them have tests, and that tests would catch most of the identifier name mutations, bringing the fragility down closer to Haskell territory. It is true that tests will probably catch some of these things. But you have to write code to make that happen! It doesn't happen by default. Also, you need to take into account the fact that the tests themselves will have some fragility. Tests require time and effort to maintain. This is an area where this notion of the fragility number becomes less accurate. I suspect that since the metric only considers single character mutations it will underestimate the fragility of tests since mutating single characters in tests will automatically cause a build failure.

There seems to be a slightly paradoxical relationship between the fragility number and DRY. Imagine our above JS factorial functions had a test that completely reimplemented factorial and then tried a bunch of random values Quickcheck-style. This would yield a fragility number of zero! Any single character change in the code would cause a test failure. And any single character change in the tests would also cause a test failure. Single character changes can no longer classified fragile because we've violated DRY. You might say that the test suite shouldn't reimplement algorithm--you should just specific cases like f(5) == 120. But in an information theory sense this is still violating DRY.

Does this mean that the fragility number is not very useful? Maybe. I don't know. But I don't think it means that we should just throw away the idea. Maybe we should just keep in mind that this particular formulation doesn't have much to tell us about the fragility more complex coordinated multi-character changes. I could see the usefulness of this metric going either way. It could simplify down to something not very profound. Or it could be that measurements of the fragility of real world software projects end up revealing some interesting insights that are not immediately obvious even from my analysis here.

Whatever the usefulness of this fragility metric, I think the concept gets is thinking about software defects in a different way than we might be used to. If it turns out that my single character mutation model isn't very useful, perhaps the extension to multi-character changes could be useful. Hopefully this will inspire more people to think about these issues and play with the ideas in a way that will help us progress towards more reliable software and tools to build it with.

EDIT: Unsurprisingly, I'm not the first person to have come up with this idea. It looks like it's commonly known as mutation testing. That Wikipedia article makes it sound like mutation testing is commonly thought of as a way to assess your project's test suite. I'm particularly interested in what it might tell us about programming languages...i.e. how much "testing" we get out of the box because of our choice of programming language and implementation.

Categories: Offsite Blogs