# News aggregator

### Ian Ross: Non-diffusive atmospheric flow #6: principal components analysis

The pre-processing that we’ve done hasn’t really got us anywhere in terms of the main analysis we want to do – it’s just organised the data a little and removed the main source of variability (the seasonal cycle) that we’re not interested in. Although we’ve subsetted the original geopotential height data both spatially and temporally, there is still a *lot* of data: 66 years of 181-day winters, each day of which has 72×15 Z500 values. This is a very common situation to find yourself in if you’re dealing with climate, meteorological, oceanographic or remote sensing data. One approach to this glut of data is something called *dimensionality reduction*, a term that refers to a range of techniques for extracting “interesting” or “important” patterns from data so that we can then talk about the data in terms of how strong these patterns are instead of what data values we have at each point in space and time.

I’ve put the words “interesting” and “important” in quotes here because what’s interesting or important is up to us to define, and determines the dimensionality reduction method we use. Here, we’re going to side-step the question of determining what’s interesting or important by using the de facto default dimensionality reduction method, *principal components analysis* (PCA). We’ll take a look in detail at what kind of “interesting” and “important” PCA give us a little later.

PCA is, in principle, quite a simple method, but it causes many people endless problems. There are some very good reasons for this:

PCA is in some sense nothing more than a generic change of basis operation (with the basis we change to chosen in a special way). The result of this is that a lot of the terminology used about PCA is also very generic, and hence very confusing (words like “basis”, “component”, “eigenvector”, “projection” and so on could mean more or less anything in this context!).

PCA is used in nearly every field where multivariate data is analysed, and is the archetypical “unsupervised learning” method. This means that it has been invented, reinvented, discovered and rediscovered many times, under many different names. Some other names for it are: empirical orthogonal function (EOF) analysis, the Karhunen-Loève decomposition, proper orthogonal decomposition (POD), and there are many others. Each of these different fields also uses different terms for the different outputs from PCA. This is

*very*confusing: some people talk about principal components, some about empirical orthogonal functions and principal component time series, some about basis functions, and so on. Here, we’re going to try to be very clear and careful about the names that we use for things to try to alleviate some of the confusion.There is a bit of a conceptual leap that’s necessary to go from very basic examples of using PCA to using PCA to analyse the kind of spatio-temporal data we have here. I used to say something like: “Well, there’s a nice two-dimensional example, and it works just the same in 100 dimensions, so let’s just apply it to our atmospheric data!” A perfectly reasonable reponse to that is: “WHAT?! Are you an idiot?”. Here, we’re going to take that conceptual leap slowly, and describe exactly how the “change of basis” view of PCA works for spatio-temporal data.

There are some aspects of the scaling of the different outputs from PCA that are

*really*confusing. In simple terms, PCA breaks your data down into two parts, and you could choose to put the units of your data on either one of those parts, normalising the other part. Which one you put the units on isn’t always an obvious choice and it’s really easy to screw things up if you do it wrong. We’ll look at this carefully here.

So, there’s quite a bit to cover in the next couple of articles. In this article, we will: explain the basic idea of PCA with a very simple (two-dimensional!) example; give a recipe for how to perform PCA on a data set; talk about why PCA works from an algebraic standpoint; talk about how to do these calculations in Haskell. Then in the next article, we will: describe exactly how we do PCA on spatio-temporal data; demonstrate how to perform PCA on the Z500 anomaly data; show how to visualise the Z500 PCA results and save them for later use. What we will end up with from this stage of our analysis is a set of “important” spatial patterns (we’ll see what “important” means for PCA) and time series of how strong each of those spatial patterns is at a particular point in time. The clever thing about this decomposition is that we can restrict our attention to the few most “important” patterns and discard all the rest of the variability in the data. That makes the subsequent exploration of the data much simpler.

The basic idea of PCAWe’re going to take our first look at PCA using a very simple example. It might not be immediately obvious how the technique we’re going to develop here will be applicable to the spatio-temporal Z500 data we really want to analyse, but we’ll get to that a little later, after we’ve seen how PCA works in this simple example and we’ve done a little algebra to get a clearer understanding of just why the “recipe” we’re going to use works the way that it does.

Suppose we go to the seaside and measure the shells of mussels1. We’ll measure the length and width of each shell and record the data for each mussel as a two-dimensional (length, width) vector. There will be variation in the sizes and shapes of the mussels, some longer, some shorter, some fatter, some skinnier. We might end up with data that looks something like what’s shown below, where there’s a spread of length in the shells around a mean of about 5 cm, a spread in the width of shells around a mean of about 3 cm, and there’s a clear correlation between shell length and width (see Figure 1 below). Just from eyeballing this picture, it seems apparent that maybe measuring shell length and width might not be the best way to represent this data – it looks as though it could be better to think of some combination of length and width as measuring the overall “size” of a mussel, and some other combination of length and width as measuring the “fatness” or “skinniness” of a mussel. We’ll see how a principal components analysis of this data extracts these two combinations in a clear way.

*The code for this post is available in a Gist. The Gist contains a Cabal file as well as the Haskell source, to make it easy to build. Just do something like this to build and run the code in a sandbox:*

Just for a slight change, I’m going to produce all the plots in this section using Haskell, specifically using the Chart library. We’ll use the hmatrix library for linear algebra, so the imports we end up needing are:

import Control.Monad import Numeric.LinearAlgebra.HMatrix import Graphics.Rendering.Chart.Easy hiding (Matrix, Vector, (|>), scale) import Graphics.Rendering.Chart.Backend.CairoThere are some name overlaps between the monadic plot interface provided by the Graphics.Rendering.Chart.Easy module and hmatrix, so we just hide the overlapping ones.

We generate 500 synthetic data points:

-- Number of test data points. n :: Int n = 500 -- Mean, standard deviation and correlation for two dimensions of test -- data. meanx, meany, sdx, sdy, rho :: Double meanx = 5.0 ; meany = 3.0 ; sdx = 1.2 ; sdy = 0.6 ; rho = 0.75 -- Generate test data. generateTestData :: Matrix Double generateTestData = let seed = 1023 mean = 2 |> [ meanx, meany ] cov = matrix 2 [ sdx^2 , rho*sdx*sdy , rho*sdx*sdy , sdy^2 ] samples = gaussianSample seed n mean cov in fromRows $ filter ((> 0) . minElement) $ toRows samplesThe mussel shell length and width values are generated from a two-dimensional Gaussian distribution, where we specify mean and standard deviation for both shell length and width, and the correlation between the length and width (as the usual Pearson correlation coefficient). Given this information, we can generate samples from the Gaussian distribution using hmatrix’s gaussianSample function. (If we didn’t have this function, we would calculate the Cholesky decomposition of the covariance matrix we wanted, generate samples from a pair of standard one-dimensional Gaussian distributions and multiple two-dimensional vectors of these samples by one of the Cholesky factors of the covariance matrix – this is just what the gaussianSample function does for us.) We do a little filtering in generateTestData to make sure that we don’t generate any negative values2.

The main program that drives the generation of the plots we’ll look at below is:

main :: IO () main = do let dat = generateTestData (varexp, evecs, projs) = pca dat (mean, cov) = meanCov dat cdat = fromRows $ map (subtract mean) $ toRows dat forM_ [(PNG, "png"), (PDF, "pdf"), (SVG, "svg")] $ \(ptype, suffix) -> do doPlot ptype suffix dat evecs projs 0 doPlot ptype suffix cdat evecs projs 1 doPlot ptype suffix cdat evecs projs 2 doPlot ptype suffix cdat evecs projs 3 putStrLn $ "FRACTIONAL VARIANCE EXPLAINED: " ++ show varexpand you can see the doPlot function that generates the individual plots in the Gist. I won’t say a great deal about the plotting code, except to observe that the new monadic API to the Chart library makes generating this kind of simple plot in Haskell no harder than it would be using Gnuplot or something similar. The plot code produces one of four plots depending on an integer parameter, which ranges from zero (the first plot above) to three. Because we’re using the Cairo backend to the Chart library, we can generate image output in any of the formats that Cairo supports – here we generate PDF (to insert into LaTeX documents), SVG (to insert into web pages) and PNG (for a quick look while we’re playing with the code).

The main program above is pretty simple: generate test data, do the PCA calculation (by calling the pca function, which we’ll look at in detail in a minute), do a little bit of data transformation to help with plotting, then call the doPlot function for each of the plots we want. Here are the plots we produce, which we’ll refer to below as we work through the PCA calculation:

Synthetic mussel shell test data for two-dimensional PCA example.

Centred synthetic mussel shell test data for two-dimensional PCA example.

PCA eigenvectors for two-dimensional PCA example.

Data projection onto PCA eigenvectors for two-dimensional PCA example.

Let’s now run through the “recipe” for performing PCA, looking at the figures above in parallel with the code for the pca function:

1 2 3 4 5 6 7 8 9 10 11 pca :: Matrix Double -> (Vector Double, Matrix Double, Matrix Double) pca xs = (varexp, evecs, projs) where (mean, cov) = meanCov xs (_, evals, evecCols) = svd cov evecs = fromRows $ map evecSigns $ toColumns evecCols evecSigns ev = let maxelidx = maxIndex $ cmap abs ev sign = signum (ev ! maxelidx) in cmap (sign *) ev varexp = scale (1.0 / sumElements evals) evals projs = fromRows $ map project $ toRows xs project x = evecs #> (x - mean)We’ll look at just why this recipe works in the next section, but for the moment, let’s just see what happens:

We start with our original mussel shell data (Figure 1 above).

We calculate the mean and covariance of our data (line 3 of the pca function listing). PCA analyses the deviations of our data from the mean, so we effectively look at “centred” data, as shown in Figure 2, where we’ve just removed the mean from each coordinate in our data. The mean and covariance calculation is conveniently done using hmatrix’s meanCov function.

Then we calculate the eigendecomposition of the covariance matrix. Because the covariance matrix is a real symmetric matrix, by construction, we know that the eigenvectors will form a complete set that we can use as a basis to represent our data. (We’re going to blithely ignore all questions of possible degeneracy here – for real data, “almost surely” means always!) Here, we do the eigendecomposition using a singular value decomposition (line 4 in the listing of the pca function). The singular values give us the eigenvalues and the right singular vectors give us the eigenvectors. The choice here to use SVD (via hmatrix’s svd function) rather than some other means of calculating an eigendecomposition is based primarily on the perhaps slightly prejudiced idea that SVD has the best and most stable implementations – here, hmatrix calls out to LAPACK to do this sort of thing, so there’s probably not much to choose, since the other eigendecomposition implementations in LAPACK are also good, but my prejudice in favour of SVD remains! If you want some better justification for why SVD is “the” matrix eigendecomposition, take a look at this very interesting historical review of the development of SVD: G. W. Stewart (1993). On the early history of the singular-value decomposition.

*SIAM Rev.***35**(4), 551-566.We do a little manipulation of the directions of the eigenvectors (lines 5-8 in the listing), flipping the signs of them to make the largest components point in the positive direction – this is mostly just to make the eigenvectors look good for plotting. The eigenvectors are shown in the Figure 3: we’ll call them e1 (the one pointing to the upper right) and e2 (the one pointing to the upper left). Note that these are unit vectors. We’ll talk about this again when we look at using PCA for spatio-temporal data.

Once we have unit eigenvectors, we can project our (centred) data points onto these eigenvectors (lines 10 and 11 of the listing: the project function centres a data point by taking off the mean, then projects onto each eigenvector using hmatrix’s matrix-vector product operator #>). Figure 4 shows in schematic form how this works – we pick out one data point in green and draw lines parallel and orthogonal to the eigenvectors showing how we project the data point onto the eigenvectors. Doing this for each data point is effectively just a change of basis: instead of representing our centred data value by measurements along the x- and y-axes, we represent it by measurements in the directions of e1 and e2. We’ll talk more about this below as well.

Finally, the eigenvalues from the eigendecomposition of the covariance matrix tell us something about how much of the total variance in our input data is “explained” by the projections onto each of the eigenvectors. I’ve put the word “explained” in quotes because I don’t think it’s a very good word to use, but it’s what everyone says. Really, we’re just saying how much of the data variance lies in the direction of each eigenvector. Just as you can calculate the variance of the mussel length and width individually, you can calculate the variance of the projections onto the eigenvectors. The eigenvalues from the PCA eigendecomposition tell you how much variance there is in each direction, and we calculate the “fraction of variance explained” for each eigenvector and return it from the pca function.

So, the pca function returns three things: eigenvalues (actually fractional explained variance calculated from the eigenvalues) and eigenvectors from the PCA eigendecomposition, plus projections of each of the (centred) data points onto each of the eigenvectors. The terminology for all these different things is very variable between different fields. We’re going to sidestep the question of what these things are called by always explicitly referring to *PCA eigenvectors* (or, later on when we’re dealing with spatio-temporal data, *PCA eigenpatterns*), *PCA explained variance fractions* and *PCA projected components*. These terms are a bit awkward, but there’s no chance of getting confused this way. We could choose terminology from one of the fields where PCA is commonly used, but that could be confusing for people working in other fields, since the terminology in a lot of cases is *not* very well chosen.

Together, the PCA eigenvectors and PCA projected components constitute nothing more than a change of orthonormal basis for representing our input data – the PCA output contains *exactly* the same information as the input data. (Remember that the PCA eigenvectors are returned as *unit vectors* from the pca function, so we really are just looking at a simple change of basis.) So it may seem as though we haven’t really done anything much interesting with our data. The interesting thing comes from the fact that we can order the PCA eigenvectors in decreasing order of the explained variance fraction. If we find that data projected onto the first three (say) eigenvectors explains 80% of the total variance in our data, then we may be justified in considering *only* those three components. In this way, PCA can be used as a *dimensionality reduction* method, allowing us to use low-dimensional data analysis and visualisation techniques to deal with input data that has high dimensionality.

This is exactly what we’re going to do with the Z500 data: we’re going to perform PCA, and take only the leading PCA eigenvectors and components, throwing some information away. The way that PCA works guarantees that the set of orthogonal patterns we keep are the “best” patterns in terms of explaining the variance in our data. We’ll have more to say about this in the next section when we look at why our centre/calculate covariance/eigendecomposition recipe works.

The algebra of PCAIn the last section, we presented a “recipe” for PCA (at least for two-dimensional data): centre the data; calculate the covariance matrix; calculate the eigendecomposition of the covariance matrix; project your centred data points onto the eigenvectors. The eigenvalues give you a measure of the proportion of the variance in your data in the direction of the corresponding eigenvector. And the projection of the data points onto the PCA eigenvectors is just a change of basis, from whatever original basis your data was measured in (mussel shell length and width as the two components of each data point in the example) to a basis with the PCA eigenvectors as basis vectors.

So why does this work? Obviously, you can use whatever basis you like to describe your data, but why is the PCA eigenbasis useful and interesting? I’ll explain this quite quickly, since it’s mostly fairly basic linear algebra, and you can read about it in more detail in more or less any linear algebra textbook3.

To start with, let’s review some facts about eigenvectors and eigenvalues. For a matrix A, an eigenvector u and its associated eigenvalue λ satisfy

Au=λu.

The first thing to note is that any scalar multiple of u is also an eigenvector, so an eigenvector really refers to a “direction”, not to a specific vector with a fixed magnitude. If we multiply both sides of this by uT and rearrange a little, we get

λ=uTAuuTu.

The denominator of the fraction on the right hand side is just the length of the vector u. Now, we can find the largest eigenvalue λ1 and corresponding eigenvector u1 by solving the optimisation problem

u1=arg maxuTu=1uTAu,

where for convenience, we’ve restricted the optimisation to find a *unit* eigenvector, and we find λ1 directly from the fact that Au1=λ1u1.

We can find next largest (in magnitude) eigenvalue and corresponding eigenvector of the matrix A by projecting the rows of A into the subspace orthogonal to u1 to give a new matrix A1 and solving the optimisation problem

u2=arg maxuTu=1uTA1u,

finding the second largest eigenvalue λ2 from A1u2=λ2u2. Further eigenvectors and eigenvalues can be found in order of decreasing eigenvalue magnitude by projecting into subspaces orthogonal to all the eigenvectors found so far and solving further optimisation problems.

This link between this type of optimisation problem and the eigenvectors and eigenvalues of a matrix is the key to understanding why PCA works the way that it does. Suppose that we have centred our (K-dimensional) data, and that we call the N centred data vectors xi, i=1,2,…N. If we now construct an N×K matrix X whose rows are the xi, then the sample covariance of the data is

C=1N−1XTX.

Now, given a direction represented as a unit vector u, we can calculate the data variance in that direction as ∣∣Xu∣∣2, so that if we want to know the direction in which the data has the greatest variance, we solve an optimisation problem of the form

u1=arg maxuTu=1(Xu)TXu=arg maxuTu=1uTCu.

But this optimisation problem is just the eigendecomposition optimisation problem for the covariance matrix C. This demonstrates we can find the directions of maximum variance in our data by looking at the eigendecomposition of the covariance matrix C in decreasing order of eigenvalue magnitude.

There are a couple of things to add to this. First, the covariance matrix is, by construction, a real symmetric matrix, so its eigenvectors form a complete basis – this means that we really can perform a change of basis from our original data to the PCA basis with no loss of information. Second, because the eigenvectors of the covariance matrix are orthogonal, the projections of our data items onto the eigenvector directions (what we’re going to call the PCA projected components) are *uncorrelated*. We’ll see some consequences of this when we look at performing PCA on the Z500 data. Finally, and related to this point, it’s worth noting that PCA is a *linear* operation – the projected components are linearly uncorrelated, but that doesn’t mean that there can’t be some nonlinear relationship between them. There are generalisations of PCA to deal with this case, but we won’t be talking about them for the purposes of this analysis.

Everything we’ve done here is pretty straightforward, but you might be wondering why we would want to change to this PCA basis at all? What’s the point? As I noted above, but is worth reiterating, the most common use for PCA, and the way that we’re going to use it with the Z500 data, is as a *dimensionality reduction* method. For the Z500 data, we have, for each day we’re looking at, 72×15=1080 spatial points, which is a lot of data to look at and analyse. What we usually do is to perform PCA, then *ignore* all but the first few leading PCA eigenvectors and projected components. Because of the way the optimisation problems described above are set up, we can guarantee that the leading m PCA eigenvectors span the m-dimensional subspace of the original data space containing the most data variance, and we can thus convince ourselves that we aren’t missing interesting features of our data by taking only those leading components. We’ll see how this works in some detail when we do the PCA analysis of the Z500 data, but in the mussel measurement case, this would correspond to thinking just of the projection of the mussel length and width data along the leading e1 eigendirection, so reducing the measurements to a single “size” parameter that neglects the variation in fatness or skinniness of the mussels. (This two-dimensional case is a bit artificial. Things will make more sense when we look at the 1080-dimensional case for the Z500 data.)

Well, not really, since I live in the mountains of Austria and there aren’t too many mussels around here, so I’ll generate some synthetic data!↩

Obviously, a Gaussian distribution is

*not right*for quantities like lengths that are known to be positive, but here we’re just generating some data for illustrative purposes, so we don’t care all that much. If we were trying to*model*this kind of data though, we’d have to be more careful.↩I like Gilbert Strang’s

*Linear Algebra and its Applications*, although I’ve heard from some people that they think it’s a bit hard for a first textbook on the subject – if you’ve had any exposure to this stuff before though, it’s good.↩

### cryptography and homomorphic library support?

### Haskell Weekly News: Issue 306

### optional ghc-options based on GHC version?

### Boolean formula: part2

### DSL implementation

### Keegan McAllister: Raw system calls for Rust

I wrote a small library for making raw system calls from Rust programs. It provides a macro that expands into in-line system call instructions, with no run-time dependencies at all. Here's an example:

#![feature(phase)]#[phase(plugin, link)]

extern crate syscall;

fn write(fd: uint, buf: &[u8]) {

unsafe {

syscall!(WRITE, fd, buf.as_ptr(), buf.len());

}

}

fn main() {

write(1, "Hello, world!\n".as_bytes());

}

Right now it only supports x86-64 Linux, but I'd love to add other platforms. Pull requests are much appreciated. :)

### Constructors on left and right hand sides of equation

### I have created... dumb sort.

Efficient on lists of small numbers!

submitted by Undo_all[link] [2 comments]

### Philip Wadler: The military and the referendum.

Many readers will have heard about Lord Dannat in the Telegraph arguing a vote for independence will dishonour Scotland's war dead.

Perhaps not as many will have heard that Jimmy Sinclair (the oldest surviving Desert Rat, aged 102), Colin May (Lieutenant Commander, Faslane), and sixteen others have written a letter slamming Dannat; at least, I didn't hear until this morning. "How dare he take their sacrifice in vain and try to turn it to political advantage?"

Both sides are reported by the BBC, though the headline mentions only one. (More #bbcbias?)

### Eric Kidd: Deploying Rust applications to Heroku, with example code for Iron

*Now with support for Iron, Cargo and Cargo.lock!*

You can deploy an example Rust application to Heroku using this button:

If you'd prefer to use the command line, you'll need git and the Heroku toolbelt. Once these are installed, run:

git clone https://github.com/emk/heroku-rust-cargo-hello.git cd heroku-rust-cargo-hello heroku create --buildpack https://github.com/emk/heroku-buildpack-rust.git git push heroku masterThis will download the example application, create a new Heroku project, and deploy the code to Heroku. That's it!

How it worksOur server is based on the Iron middleware library. We parse URL parameters and dispatch HTTP requests to the appropriate handler routine using Iron's router module:

fn main() { let mut router = Router::new(); router.get("/", hello); router.get("/:name", hello_name); Iron::new(router).listen(Ipv4Addr(0, 0, 0, 0), get_server_port()); }The hello function returns an HTTP status code and the content to send to the user:

fn hello(_: &mut Request) -> IronResult<Response> { Ok(Response::with(status::Ok, "Hello world!")) }The hello_name function is similar, but we look up the value of the :name parameter that we declared to the router, above.

fn hello_name(req: &mut Request) -> IronResult<Response> { let params = req.extensions.find::<Router,Params>().unwrap(); let name = params.find("name").unwrap(); Ok(Response::with(status::Ok, format!("Hello, {}!", name))) }The final piece needed to run on Heroku is a function to look up up our port number:

fn get_server_port() -> Port { getenv("PORT") .and_then(|s| from_str::<Port>(s.as_slice())) .unwrap_or(8080) }The full source code is available on GitHub. To learn more about Rust, see the excellent Rust guide.

Keep reading for notes on building the program locally and on configuring your own programs to run on Heroku.

### Philip Wadler: The case for Europe

Readers of this list will know that I don't always see eye-to-eye with Alex Salmond. Nonetheless, I think he put the case for Europe well this morning on the Today Program.In a BBC interview just the other night, a spanish minister being

interviewed by Kirsty Wark despite being invited three or four times

refused to say that Spain would attempt to veto Scottish

membership. And the reason for that of course is that the Spanish

government have already said that in the circumstance of a consented

democratic referendum, as they put it, Spain would have nothing to say

about it.

We can go through legal opinion and expert opinion as much as we like.

I think the answer is in four figures: 1, 20, 25, and 60.

1% is the percentage of Scotland's population compared to the European Union.

20% is the percentage of the fish stocks of the entire European Union.

25% is the percentage of the renewable energy of the entire European Union offshore.

And 60% is the oil reserves that Scotland has.

Anyone who believes that a country [with these resources] is not

going to be welcome in the wider Europe doesn't understand the process

by which Europe accepts democratic results and that Scotland has a

huge amount of attractiveness to the rest of the European continent.

You can hear the original here, the relevant segment starts at around 8:00.

### ICFP 2014 Workshops (Videos) : haskell

### Fairly small projects (~1 to 7 days) to do and learn from?

I recently started using haskell and it's quickly become my favorite language. Unfortuantly, I have nothing to apply it to. What are some small projects I can do for fun and to practice my skills?

submitted by Undo_all[link] [18 comments]