News aggregator

how to determine the physical harddrive a file is on?

Haskell on Reddit - Thu, 03/26/2015 - 4:27am

Hello there, I had some haskell in university but have a strong scala background and decided to give haskell a try again. My first haskell post ever...

As an useful exercise, I want to write a duplicate scanner, which shall run in parallel. However, the parallelization factor should match the amount of >physical drives< that have to be scanned. Logical drives / partitions would likely decrease performance because then the HD head has to jump uselessly. For that, I need a way to determine the drive and in particular distinguish logical from physical drives. In scala/ on the JVM, this doesn't seem to be possible, so I resorted to an external bash command. Is this possible in haskell directly?

submitted by ib84
[link] [6 comments]
Categories: Incoming News

FP Complete: Our composable community infrastructure

Planet Haskell - Wed, 03/25/2015 - 11:50pm
Our composable community infrastructure

TL;DR: we propose to factor Hackage into a separate, very simple service serving a stash of Haskell packages, with the rest of Hackage built on top of that, in the name of availability, reliability and extensibility.

One of the main strengths of Haskell is just how much it encourages composable code. As programmers, we are goaded along a good path for composability by the strictures of purity, which forces us to be honest about the side effects we might use, but first and foremost because first class functions and lazy evaluation afford us the freedom to decompose solutions into orthogonal components and recompose them elsewhere. In the words of John Hughes, Haskell provides the necessary glue to build composable programs, ultimately enabling robust code reuse. Perhaps we ought to build our shared community infrastructure along the same principles: freedom to build awesome new services by assembling together existing ones, made possible by the discipline to write these basic building blocks as stateless, scalable, essentially pure services. Let's think about how, taking packages hosting as an example, with a view towards solving three concrete problems:

  • Availability of package metadata and source code (currently these are no longer available when hackage.haskell.org goes down).
  • Long cabal update download times.
  • The difficulty of third party services and other community services to interoperate with hackage.haskell.org and extend it in any direction the community deems fit.
Haskell packages

Today Haskell packages are sets of files with a distinguished *.cabal file containing the package metadata. We host these files on a central package repository called Hackage, a community supported service. Hackage is a large service that has by and large served the community well, and has done so since 2007. The repository has grown tremendously, by now hosting no less than 5,600 packages. It implements many features, some of which include package management. In particular, Hackage allows registered users to:

  • Upload a new package: either from the browser or via cabal upload.

  • Download an index of all packages available: this index includes the full content of all *.cabal files for all packages and all versions.

  • Query the package database via a web interface: from listing all packages available by category, to searching packages by name. Hackage maintains additional metadata for each package not stored in the package itself, such as download counts, package availability in various popular Linux distributions. Perhaps in the future this metadata will also include social features such as number of "stars", à la Github.

Some of the above constitute the defining features of a central package repository. Of course, Hackage is much more than just that today - it is a portal for exploring what packages are out there through a full blown web interface, running nightly builds on all packages to make sure they compile and putting together build reports, generating package API documentation and providing access to the resulting HTML files, maintaining RSS feeds for new package uploads, generating activity graphs, integration with Hoogle and Hayoo, etc.

In the rest of this blog post, we'll explore why it's important to tweeze out the package repository from the rest, and build the Hackage portal on top of that. That is to say, talk separately about Hackage-the-repository and Hackage-the-portal.

A central hub is the cement of the community

A tell-tale sign of a thriving development community is that a number of services pop up independently to address the needs of niche segments of the community or indeed the community as a whole. Over time, these community resources together as a set of resources form an ecosystem, or perhaps even a market, in much the same way that the set of all Haskell packages form an ecosystem. There is no central authority deciding which package ought to be the unique consecrated package for e.g. manipulating filesystem paths: on Hackage today there are at least 5, each exploring different parts of the design space.

However, we do need common infrastructure in place, because we do need consensus about what package names refer to what code and where to find it. People often refer to Hackage as the "wild west" of Haskell, due to its very permissive policies about what content makes it on Hackage. But that's not to say that it's an entirely chaotic free-for-all: package names are unique, only designated maintainers can upload new versions of some given package and version numbers are bound to a specific set of source files and content for all time.

The core value of Hackage-the-repository then, is to establish consensus about who maintains what package, what versions are available and the metadata associated with each version. If Alice has created a package called foolib, then Bob can't claim foolib for any of his own packages, he must instead choose another name. There is therefore agreement across the community about what foolib means. Agreement makes life much easier for users, tools and developers talking about these packages.

What doesn't need consensus is anything outside of package metadata and authorization: we may want multiple portals to Haskell code, or indeed have some portals dedicated to particular views (a particular subset of the full package set) of the central repository. For example, stackage.org today is one such portal, dedicated to LTS Haskell and Stackage Nightly, two popular views of consistent package sets maintained by FP Complete. We fully anticipate that others will over time contribute other views &mdash; general-purpose or niche (e.g. specialized for a particular web framework) &mdash; or indeed alternative portals &mdash; ranging from small, fast and stable to experimental and replete with social features aplenty. Think powerful new search functionality, querying reverse dependencies, pre-built package sets for Windows, OS X and Linux, package reviews, package voting ... you name it!

Finally, by carving out the central package repository into its own simple and reliable service, we limit the impact of bugs on both availability and reliably, and thus preserve on one of our most valuable assets: the code that we together as a community have written. Complex solutions invariably affect reliability. Keeping the core infrastructure small and making it easy to build on top is how we manage that.

The next section details one way to carve the central package repository, to illustrate my point. Alternative designs are possible of course - I merely wish to seek agreement that a modular architecture with at its core a set of very small and simple services as our community commons would be beneficial to the community.

Pairing down the central hub to its bare essence

Before we proceed, let's first introduce a little bit of terminology:

  • A persistent data structure is an data structure that is never destructively updated: when modified, all previous versions of the data structure are still available. Data.Map from the containers package is persistent in this sense, as are lists and most other data structures in Haskell.
  • A service is stateless if its response is a function of the state of other services and the content of the request. Stateless services are trivially scaled horizontally - limited only by the scalability of the services they depend on.
  • A persistent service is a service that maintains its only state as a persistent data structure. Most resources served by a persistent service are immutable. Persistent services share many of the same properties as stateless services: keeping complexity down and scaling them is easy because concurrent access and modification of a persistent data structure requires little to no coordination (think locks, critical sections, etc).

A central hub for all open source Haskell packages might look something like this:

  • A persistent read-only directory of metadata for all versions of all packages (i.e. the content of the .cabal file). Only the upload service may modify this directory, and even then, only in a persistent way.

  • A persistent read-only directory of the packages themselves, that is to say the content of the archives produced by cabal sdist.

  • An upload service, for uploading new revisions of the metadata directory. This service maintains no state of its own, therefore multiple upload services can be spawned if necessary.

  • An authentication service, granting access tokens to users for adding a new package or modifying the metadata for their own existing packages via the upload service(s).

The metadata and package directories together form a central repository of all open source Haskell packages. Just as is the case with Hackage today, anyone is allowed to upload any package they like via the upload service. We might call these directories collectively The Haskell Stash. End-user command-line tools, such as cabal-install, need only interact with the Stash to get the latest list of packages and versions. If the upload or authentication services go down, existing packages can still be downloaded without any issue.

Availability is a crucial property for such a core piece of infrastructure: users from around the world rely on it today to locate the dependencies necessary for building and deploying Haskell code. The strategy for maintaining high availability can be worked out independently for each service. A tried and tested approach is to avoid reinventing as much of the wheel as we can, reusing existing protocols and infrastructure where possible. I envision the following implementation:

  • Serve the metadata directory as a simple Git repository. A Git repository is persistent (objects are immutable and live forever), easy to add new content to, easy to backup, easy to mirror and easy to mine for insights on how packages changes over time. Advanced features such as package candidates fall out nearly for free. Rather than serving in its entirety a whole new static tarball of all package metadata (totalling close to 9MB of compressed data) as we do today, we can leverage the existing Git wire protocol to transfer new versions to end users much more efficiently. In short, a faster cabal update!.

    The point here is very much not to use Git as a favoured version control system (VCS), fine as it may be for that purpose, at the expense of any other such tool. Git is at its core an efficient persistent object store first and foremost, with a VCS layered on top. The idea is to not reinvent our own object store. It features a simple disk format that has remained incredibly stable over the years. Hosting all our metadata as a simple Git repository means we can leverage any number of existing Git hosting providers to serve our community content with high uptime guarantees.

  • Serve package source archives (produced by cabal sdist) via S3, a de facto standard API for file storage, supported by a large array of cloud providers. These archives can be large, but unlike package metadata, their content is fixed for all time. Uploading a new version of a package means uploading a new source archive with a different name. Serving our package content via a standard API means we can have that content hosted on a reliable cloud platform. In short, better uptime and higher chance that cabal install will not randomly fail.

Conclusion

The Haskell Stash is a repository in which to store our community's shared code assets in as simple, highly available and composable a manner as possible. Reduced to its bare essence, easily consumable by all manner of downstream services, most notably, Hackage itself, packdeps.haskellers.org, hdiff.luite.com, stackage.org, etc. It is by enabling people to extend core infrastructure in arbitrary directions that we can hope to build a thriving community that meets not just the needs of those that happened to seed it, but that furthermore embraces new uses, new needs, new people.

Provided community interest in this approach, the next steps would be:

  1. implement the Haskell Stash;
  2. implement support for the Haskell Stash in Hackage Server;
  3. in the interim, if needed, mirror Hackage content in the Haskell Stash.

In the next post in this series, we'll explore ways to apply the same principles of composability to our command-line tooling, in the interest of making our tools more hackable, more powerful and ship with fewer bugs.

Categories: Offsite Blogs

Pure for free.

Haskell on Reddit - Wed, 03/25/2015 - 4:29pm

This allows functors to be turned into applicatives without a definition for pure. This is useful to make appliatives for datatypes where pure would have been difficult.

So I wanted to make images and instance of applicative in a similar way to ziplists.

Looking specifically at the array type, there is no obvious way to instantiate pure for a zip array. With the following type, arrays can be made into applicative zip arrays.

data Ap f a = Pure a | Ap (f a) deriving(Show) instance Functor f => Functor (Ap f) where fmap f (Pure a) = Pure (f a) fmap f (Ap x) = Ap (fmap f x) class Applicable f where apply :: f (a -> b) -> f a -> f b instance (Functor f, Applicable f) => Applicative (Ap f) where pure = Pure (<*>) (Pure f) g = fmap f g (<*>) (Ap f) (Pure g) = Ap (fmap ($ g) f) (<*>) (Ap f) (Ap g) = Ap (apply f g)

Given only that the type f is a functor, this already satisfies the applicative identity, homomorphism, and interchange laws. The composition law for applicative places some constraints on the Applicable instance. (I'm still working out what those are).

With this, we can make a zipList with instance Applicable [] where apply = zipWith ($)

And the normal list applicative as instance Applicable [] where apply = (<*>)

Finally, we can use this to make zip arrays, with the following. Here, the zip clamps the dimensions to the area where the two arrays overlap

specialZip :: Ix i => (a -> b -> c) -> Array (i,i) a -> Array (i,i) b -> Array (i,i) c specialZip f xs ys = array bnds vals where bnds = newBounds (bounds xs) (bounds ys) allindex = range bnds valAt indx = f (xs ! indx) (ys ! indx) vals = fmap (\i -> (i, valAt i)) allindex newBounds :: Ix i => ((i,i), (i,i)) -> ((i,i), (i,i)) -> ((i,i), (i,i)) newBounds (lower, upper) (lower', upper') = (max' lower lower', min' upper upper') max' :: (Ord t1, Ord t) => (t, t1) -> (t, t1) -> (t, t1) max' (a, b) (x, y) = (max a x, max b y) min' :: (Ord t1, Ord t) => (t, t1) -> (t, t1) -> (t, t1) min' (a, b) (x, y) = (min a x, min b y) newtype Grid i e = Grid {arr :: Array (i, i) e} instance (Ix i) => Functor (Grid i) where fmap f g = Grid $ fmap f (arr g) instance (Ix i) => Applicable (Grid i) where apply a b = Grid $ specialZip ($) (arr a) (arr b)

Intuitively, it seems like that anything that can be zipped is also an applicative, but I haven't checked if the laws hold up.

submitted by antiquemilkshake
[link] [12 comments]
Categories: Incoming News

indexed writer monad

haskell-cafe - Wed, 03/25/2015 - 3:32pm
Anyone? I can handle monads, but I have something (actually in F#) that feels like it should be something like a indexed writer monad (which F# probably wouldn't support). So I thought I'd do some research in Haskell. I know little or nothing about indexed monad (though I have built the indexed state monad in C#). So I would assume there would be an indexed monoid (that looks at bit like a tuple?)... e.g. (a,b) ++ (c,d) = (a,b,c,d) (a,b,c,d) ++ (e) = (a,b,c,d,e) ? There seems to be some stuff about "update monads", but it doesn't really look like a writer. I could do with playing around with an indexed writer, in order to get my head around what I'm doing....then try and capture what I'm doing...then try (and fail) to port it back. CONFIDENTIALITY NOTICE This e-mail (and any attached files) is confidential and protected by copyright (and other intellectual property rights). If you are not the intended recipient please e-mail the sender and then delete the email and any attached files immediately.
Categories: Offsite Discussion

Ketil Malde: Can you trust science?

Planet Haskell - Wed, 03/25/2015 - 2:00pm

Hardly a week goes by without newspaper writing about new and exciting results from science. Perhaps scientists have discovered a new wonderful drug for cancer treatment, or maybe they have found a physiological cause for CFS. Or perhaps this time they finally proved that homeopathy works? And in spite of these bold announcements, we still don't seem to have cured cancer. Science is supposed to be the method which enables us to answer questions about how the world works, but one could be forgiven for wondering whether it, in fact, works at all.

As my latest contribution to my local journal club, I presented a paper by Ioannidis, titled Why most published research findings are false 1. This created something of a stir when it was published in 2005, because it points out some simple mathematical reasons why science isn't as accurate as we would like to believe.

The ubiquitous p-value

Science is about finding out what is true. For instance, is there a relationship between treatment with some drug and the progress of some disease - or is there not? There are several ways to go about finding out, but in essence, it boils down to making some measurements, and doing some statistical calculations. Usually, the result will be reported along with a p-value, which is a by-product of the statistical calculations saying something about how certain we are of the results.

Specifically, if we claim there is a relationship, the associated p-value is the probability we would make such a claim even if there is no relationship in reality.

We would like this probability to be low, of course, and since we usually are free to select the p-value threshold, it is usually chosen to be 0.05 (or 0.01), meaning that if the claim is false, we will only accept it 5% (or 1%) of the times.

The positive predictive value

Now, the p-value is often interpreted as the probability of our (positive) claim being wrong. This is incorrect! There is a subtle difference here, which it is important to be aware of. What you must realize, is that the probability α relies on the assumption that the hypothesis is wrong - which may or may not be true, we don't know (which is precisely why we want to find out).

The probability of a claim being wrong after the fact is called the positive predictive value (PPV). In order to say something about this, we also need to take into account the probability of claiming there exists a relationship when the claim is true. Our methods aren't perfect, and even if a claim is true, we might not have sufficient evidence to say for sure.

So, take one step back and looking at our options. Our hypothesis (e.g., drug X works against disease Y) can be true or false. In either case, our experiment and analysis can lead us to reject or accept it with some probability. This gives us the following 2-by-2 table:

True False Accept 1-β α Reject β 1-α

Here, α is the probability of accepting a false relationship by accident (i.e., the p-value), and β is the probability of missing a true relationship -- we reject a hypothesis, even when it is true.

To see why β matters, consider a hypothetical really really poor method, which has no chance of identifying a true relationship, in other words, $\beta$=1. Then, every accepted hypothesis must come from the False column, as long as α is at all positive. Even if the p-value threshold only accepts 1 in 20 false relationships, that's all you will get, and as such, they constitute 100% of the accepted relationships.

But looking at β is not sufficient either. Let's say a team of researchers test hundreds of hypotheses, which all happen to be false? Then again, some of them will get accepted anyway (sneaking in under the p-value threshold α), and since there are no hypotheses in the True column, again every positive claim is false.

A β of 1 or a field of research with 100% false hypotheses are extreme cases2, and in reality, things are not quite so terrible. The Economist had a good article with a nice illustration showing how this might work in practice with more reasonable numbers. It should still be clear that the ratio of true to false hypotheses being tested, as well as the power of the analysis to identify true hypotheses are important factors. And if these numbers approach their limits, things can get quite bad enough.

More elaborate models

Other factors also influence the PPV. Try as we might to be objective, scientists often try hard to find a relationship -- that's what you can publish, after all3. Perhaps in combination with a less firm grasp of statistics than one could wish for (and scientists who think they know enough statistics are few and far between - I'm certainly no exception there), this introduces bias towards acceptance.

Multiple teams pursuing the same challenges in a hot and rapidly developing field also decrease the chance of results being correct, and there's a whole cottage industry of scientist reporting spectacular and surprising results in high-ranking journals, followed by a trickle of failures to replicate.

Solving this

One option is to be stricter - this is the default when you do multiple hypothesis testing, you require a lower p-value threshold in order to reduce α. The problem with this is that if you are stricter with what you accept as true, you will also reject more actually true hypotheses. In other words, you can reduce α, but only at the cost of increasing β.

On the other hand, you can reduce β by running a larger experiment. One obvious problem with this is cost, for many problems, a cohort of a hundred thousand or more is necessary, and not everybody can afford to run that kind of studies. Perhaps even worse, a large cohort means that almost any systematic difference will be found significant. Biases that normally are negligible will show up as glowing bonfires in your data.

In practice?

Modern biology has changed a lot in recent years, and today we are routinely using high-throughput methods to test the expression of tens of thousands of genes, or the value of hundreds of thousands of genetic markers.

In other words, we simultaneously test an extreme number of hypotheses, where we expect a vast majority of them to be false, and in many cases, the effect size and the cohort are both small. It's often a new and exciting field, and we usually strive to use the latest version of the latest technology, always looking for new and improved analysis tools.

To put it bluntly, it is extremely unlikely that any result from this kind of study will be correct. Some people will claim these methods are still good for "hypothesis generation", but Ioannidis shows a hypothetical example where a positive result increases the likelihood that a hypothesis is correct by 50%. This doesn't sound so bad, perhaps, but in reality, the likelihood is only improved from 1 in 10000 to 1 in 7000 or so. I guess three thousand fewer trials to run in the lab is something, but you're still going to spend the rest of your life running the remaining ones.

You might expect scientists to be on guard for this kind of thing, and I think most scientists will claim they desire to publish correct results. But what counts for your career is publications and citations, and incorrect results are no less publishable than correct ones - and might even get cited more, as people fail to replicate them. And as you climb the academic ladder, publications in high-ranking journals is what counts, an for that you need spectacular results. And it is much easier to get spectacular incorrect results than spectacular correct ones. So the academic system rewards and encourages bad science.

Consequences

The bottom line is to be skeptical of any reported scientific results. The ability of the experiment and analysis to discover true relationships is critical, and one should always ask what the effect size is, and what the statistical power -- the probability of detecting a real effect -- is.

In addition, the prior probability of the hypothesis being true is crucial. Apparently-solid, empirical evidence of people getting cancer from cell phone radiation, or working homeopathic treatment of disease can almost be dismissed out of hand - there simply is no probable explanation for how that would work.

A third thing to look out for, is how well studied a problem is, and how the results add up. For health effects of GMO foods, there is a large body of scientific publications, and an overwhelming majority of them find no ill effects. If this was really dangerous, wouldn't some of these investigations show it conclusively? For other things, like the decline of honey bees, or the cause of CFS, there is a large body of contradictory material. Again - if there was a simple explanation, wouldn't we know it by now?

  1. And since you ask: No, the irony of substantiating this claim with a scientific paper is not lost on me.

  2. Actually, I would suggest that research in paranormal phenomena is such a field. They still manage to publish rigorous scientific works, see this Less Wrong article for a really interesting take.

  3. I think the problem is not so much that you can't publish a result claiming no effect, but that you can rarely claim it with any confidence. Most likely, you just didn't design your study well enough to tell.

Categories: Offsite Blogs

Confusion regarding the differences between ByteString types

Haskell on Reddit - Wed, 03/25/2015 - 12:34pm

Hello /r/haskell. Currently I'm working on a server program that needs to parse incoming data from sockets. I'm a bit confused as to which data types I should be using. I believe I need to use Strict ByteStrings, and some of the incoming data might be unicode.

Basically, I parse the first N bytes following an established protocol, and a payload that might represent valid unicode follows. From what I understand, it makes sense to interpret the socket data as a ByteString, and convert the relevant unicode portion to Text.

When I hGet the data from my socket handle, which ByteString should I be choosing?

Data.ByteString(ByteString) Data.ByteString.Lazy(ByteString) Data.ByteString.Char8 (ByteString) ... etc.

In addition, what difference will there be between strict and lazy bytestrings, and under what situation should I choose one or the other? Some libraries return lazy bytestrings when I am using strict ones, and vice-versa.

Thank you in advance for helping clear my confusion.

submitted by pythonista_barista
[link] [10 comments]
Categories: Incoming News

Where does GHC spend most of it's time during compilation?

Haskell on Reddit - Wed, 03/25/2015 - 8:04am

I'm just wondering what contributing factors result in long compile times for GHC (with the exception of Template Haskell). Is it things like type checking & analysis, is it the codegen process, or whatever.

submitted by gaymenonaboat
[link] [15 comments]
Categories: Incoming News

FP Complete: FP Complete's Hackage mirror

Planet Haskell - Wed, 03/25/2015 - 8:00am

We have been running a mirror of Hackage package repository which we use internally for the FP Complete Haskell Centre's IDE, building Stackage, and other purposes. This has been an open secret, but now we're making it official.

To use it, replace the remote-repo line in your ~/.cabal/config with the following:

remote-repo: hackage.fpcomplete.com:http://hackage.fpcomplete.com/

Then run cabal update, and you're all set.

This mirror is updated every half-hour. It is statically hosted on Amazon S3 so downtime should be very rare (Amazon claims 99.99% availability).

The mirror does not include the HTML documentation. However, Stackage hosts documentation for a large set of packages.

We have also released our hackage-mirror tool. It takes care of efficiently updating a static mirror of Hackage on S3, should anyone wish to host their own. While the official hackage-server has its own support for mirroring, our tool differs in that it does not require running a hackage-server process to host the mirror.

HTTPS for Stackage

On a tangentially related note, we have enabled TLS for www.stackage.org. Since cabal-install does not support TLS at this time, we have not set up an automatic redirect from insecure connections to the https:// URL.

Categories: Offsite Blogs

What are your most persuasive examples of using Quickcheck?

Haskell on Reddit - Wed, 03/25/2015 - 5:32am

I'm writing documentation for my Python Quickcheck-like library, Hypothesis and I'm looking for examples of using Quickcheck that are a little more persuasive than reversing a list or checking commutativity of numbers.

In particular I'm looking for examples that make people go "Oh, I could totally use this in $DAYJOB". I find most quickcheck examples seem to start from the assumption that you're writing a library, which most people aren't.

The examples I have so far are:

But I'd really like more, particularly ones from domains where you wouldn't necessarily think to have used Quickcheck, or ones with a certain "wow!" factor.

submitted by DRMacIver
[link] [45 comments]
Categories: Incoming News

Command Line Args passed to wxHaskell

haskell-cafe - Wed, 03/25/2015 - 4:20am
I am seeing strange behavior with a wxHaskell app compiled on Windows 7. On Linux, all is well. I can call my app like: app +RTS -N4 -RTS myArg And in the app I can process the myArg and start a wxHaskell frame. When I compile the same application on Windows 7, I get an error dialog box that says: “Unexpected parameter `+RTS`. And a second Usage dialog that looks like it comes from wxHaskell. I am not sure why Windows is different, but perhaps it is the fact that on Windows 7 I compiled up wxHaskell 0.92, and on Linux I used 0.91 from a cabal update. I used 0.92 on on Windows because I could not get 0.91 to compile due to some type problems where the wxPack version was incompatible with a header file and the Haskell compiler related to 64 bit types long long. There is some noise about this on the web, but no solutions. Nonetheless, I assume that args are grabbed directly by wxHaskell and Environment.getArgs does not consume them such that they are still available to wxHaskell. Is there some way to
Categories: Offsite Discussion