Principal Components Analysis, Eigenfaces, and Karhunen-Loève Expansions

Principal Components Analysis

Instead of photographs, let’s suppose we’re just looking at vectors, i.e. points in space, randomly drawn from some unknown distribution.

Part of the crabs dataset in R's MASS package.

Part of the crabs dataset in R’s MASS package.

In the image above, the points clearly vary most along some diagonal line. There are a few things we might want to do with this:

1. We might want to plot the data so that the axes represent key features. In this case, we’d like to rotate the plot, so that this diagonal line is along the horizontal axis.

2. We might want to plot the data so that a point’s position on one axis is linearly independent of its position along the other ones. In this way, we could generate new points from the distribution without having to overly worry about covariance issues, because we can generate the position along each axis separately.

Happily, PCA does both of these.

The new axes. Note one appears to go along some sort of central line in the data.

The new perpendicular axes. Note one appears to go along some sort of central line in the data.

The data plotted with respect to the new axes. The data's variance clearly increases from left to right, but the covariance is zero.

The data plotted with respect to the new axes. The data’s variance clearly increases from left to right, but the covariance is zero.

I might get round to looking at the mathematics behind how PCA does this in a later post. For now, let’s see how PCA works for our photographs.

Eigenface Decomposition

The first step is to find the mean face, which we did last time. If we now subtract the red, blue and green intensities for pixels of this mean face from those of the originals, we get this shifty group.

The “difference faces”, with thresholding.

These faces look very dark. The reason for this is fairly simple: when we subtract the mean face, a large number of the colour intensities will end up being negative, rather than in [0,1]. R refuses to plot anything when this happens, so I’ve simply told it to count negative values as being zero — pitch black. Later on, I’ll also count values greater than one as being one — pitch white. Jeremy Kun notices the same tendency to darkness, but doesn’t really explain why: I suspect Mathematica, the software Kun uses in his post, does this “thresholding” automatically.

As an alternative to thresholding, I’ll often be linearly “rescaling” images, so that the smallest intensity goes to zero, and the largest goes to one. For the “difference faces”, since subtracting the mean face is also a linear transformation, rescaling means we get an image that’s pretty similar to the original.

The difference faces with rescaling.

The difference faces with rescaling.

So far, so good. Now for the eigenvectors, or “eigenfaces”. Remember, these represent the directions in which we expect variation to be independent, with highest-variance directions first. The photographs are 66 by 58 pixels in three colours, so these vectors have 66\times58\times3=11484 elements each, and are each normalised so that the sum of the squares of their intensities is equal to one, so it shouldn’t be too surprising that their intensities are all close to zero. This makes thresholding useless, so we just show the rescaled version.

The eigenfaces, rescaled. The rescaling was done ignoring the last face, as this is essentially random noise, with a far larger intensity range, but no informative value.

The eigenfaces, rescaled. The rescaling was done ignoring the last face, as this is essentially random noise, with a far larger intensity range, but no informative value.

These are certainly less sinister than those in Kun’s article. I assume this is the consequence of looking at such a homogenous set of people, and Türk doing a good job of positioning everyone in the same place in the frame. Such a small and cooperative data set makes this blogger a happy little theorist.

We wouldn’t really expect these eigenfaces to exactly align with introducing or removing different face features, but looking at them shows some features a lot more obviously than others. For example: the first eigenface appears to be determining asymmetry in lighting on the two sides of the face; the second one partially accounts for glasses, and how far down the hair comes over the forehead; eigenface nine accounts for asymmetry in how far the hair comes down on each side of the forehead, and also looks rather similar to the original face nine, since that is the only face with such a drastic asymmetry; and so on.

So, what can we do with these eigenfaces? Well, we can easily decompose a face into the mean face, plus a linear sum of the eigenfaces — twenty-five of them, here. If we want to reduce the size of the data, we can start throwing out some of the eigenfaces. In particular, since we’ve calculated how much variability there is along each eigenface, we can throw out the least variable ones first. This way, we minimise how much of the variance between the faces is thrown out, and so keep the faces as distinct as possible.

To illustrate this, we can introduce the eigenfaces for a face one at a time, to observe the effect on the image:

Progression for face one.

Progression for face one.

Progression for face two.

Progression for the more distinctive face nine.

Some faces will become clear faster than others, but it seems like both faces become close to the originals by, say, eighteen components. Indeed, if we only use eighteen components for each face, we get the following:

Class Average I, reconstructed using only eighteen of twenty-five components.

Class Average I, reconstructed using only eighteen of twenty-five components.

Faces like number ten are a bit vague, but that’s mostly pretty good.

So, what else can we do? Well, since we know how faces vary along each eigenface, we can try generating sets of new faces. The results can be, well, rather mixed. Sometimes the results look OK, sometimes they don’t look convincing at all.

This set doesn't look too convincing to me.

This set doesn’t look too convincing to me.

This one looks better.

This one looks better.

The one on the left's been in the wars. The one on the right looks shocked to be here.

The one on the left’s been in the wars. The one on the right looks shocked to be here.

This is probability mostly due to my sampling the valukeepf the eigenface components as independent normal distributions, which makes no sense in the context of the problem.

That’s about it for now. There are a few diagnostic plots I can show off once I find them again, allowing you to do things like assigning a quantity to how distinctive each face is from the rest of the set (nine and twenty-one stand out the most), and a more quantitive assessment of how many eigenfaces to throw out while keeping the faces distinguishable.

Principal Components Analysis, Eigenfaces, and Karhunen-Loève Expansions: Part One

Last time, we were talking about interpolation and orthogonal polynomials. For this series, we’re also going to end up talking about orthogonal vectors and orthogonal functions. But first, we’ll look at some old photos from the Seventies.

Principal Components Analysis
The Budapest National Gallery has its floors arranged in chronological order, so the first floor is given over to an overwhelming sea of medieval paintings of Bible stories and prophets. Well-painted as they are, when the founder of a country is a King Saint, you know you’re in for the long haul on this floor. However, the visitor who hasn’t run out of patience before they reach the floors for the 20th Century will find a cute pair of photographs by Péter Türk.

One is a collation of shots of Türk’s class. For the second, he chopped each of the shots into a grid of squares. The top-left squares from each shot have then been placed together in one large square. The squares of the next grid place along have then been placed alongside, and so on. The result is something that, from a distance, looks like a new face.

Class Average I, Péter Türk, 1979.

Class Average I, Péter Türk, 1979.

Class Average II, Péter Türk, 1979.

Class Average II, Péter Türk, 1979.

Today, this would be done by computer — a mechanical Türk, so to speak — and the squares would be blended together instead of being placed in blocks, but what we have here is some idea of a “mean face”, the average of all the individual photos. It’s not exactly like any of the individual photos, but clearly shows a lot of their shared features. Especially the hair. If we let the computer take the mean over each pixel position, rather than placing them next to each other, we obtain something that looks surprisingly human.

The true mean face, or as close as we can get with low-quality online scans of the originals.

The true mean face, or as close as we can get with low-quality online scans of the originals.

Class Average II rescaled, for comparison.

Class Average II rescaled, for comparison.

It’s worth mentioning — not for the last time — how good a job Türk did at keeping his photographs consistent, with regards to face positioning. This is a small group of people, yes, but if we take the eyes as an example, the eye positions are so similar that the mean face still has pupils!

This tinkering is well and good, but consider what we might want to do with a large collection of such photos.
1. Say we wanted to make a digital database with thousands of these photos. Maybe we want to set up some sort of photographic identification system, where we compare a new photo, of someone requesting access to something, against the photos we have on file. How would we decide if the new photo was similar enough to one on the database to grant them access?
2. Along similar lines, suppose we’re not interested specifically in the people in the photos, but we are interested in finding some rules of thumb for telling between different people in the same population. How would we do so?
3. Now suppose we’d also like to compress the photos somehow, to save on memory. How should we do so while keeping the photos as distinguishable as possible?

One method we can use for this is Principal Components Analysis, which I’ll be talking about over the next few posts. However, here’s a brief taste of what it allows us to do, statistically rather than by guesswork:
1. PCA gives us a way to take a new photo, and make some measure of its “distance” from our originals. We can then decide that it’s a photo of the person in the closest original, or that it’s a new person if all the “distances” are too large.
2. The most important features for distinguishing between people in the above set of photos are the side of their face the light source is on, and how far down their fringe comes.
3. We can compress the photos in such a way that we know how much of the original variance, or distinctiveness between the photos, that we keep. If we don’t mind compressing by different amounts for different photos, we could also keep a set amount of distinctiveness for each photo, rather than across the whole group.
4. We can try — with variable levels of success — to generate new faces from scratch.

Also worth noting is that, apart from 4., all of this can be done with only a covariance estimate: we make no assumptions about the distribution the photos are drawn from.

We’ll come back to these photos, and these applications, later in the series. Next time, we’ll look at something a bit simpler first.