Principal Components Analysis, Eigenfaces, and Karhunen-Loève Expansions

Principal Components Analysis

Instead of photographs, let’s suppose we’re just looking at vectors, i.e. points in space, randomly drawn from some unknown distribution.

Part of the crabs dataset in R’s MASS package.

In the image above, the points clearly vary most along some diagonal line. There are a few things we might want to do with this:

1. We might want to plot the data so that the axes represent key features. In this case, we’d like to rotate the plot, so that this diagonal line is along the horizontal axis.

2. We might want to plot the data so that a point’s position on one axis is linearly independent of its position along the other ones. In this way, we could generate new points from the distribution without having to overly worry about covariance issues, because we can generate the position along each axis separately.

Happily, PCA does both of these.

The new perpendicular axes. Note one appears to go along some sort of central line in the data.

The data plotted with respect to the new axes. The data’s variance clearly increases from left to right, but the covariance is zero.

I might get round to looking at the mathematics behind how PCA does this in a later post. For now, let’s see how PCA works for our photographs.

Eigenface Decomposition

The first step is to find the mean face, which we did last time. If we now subtract the red, blue and green intensities for pixels of this mean face from those of the originals, we get this shifty group.

The “difference faces”, with thresholding.

These faces look very dark. The reason for this is fairly simple: when we subtract the mean face, a large number of the colour intensities will end up being negative, rather than in [0,1]. R refuses to plot anything when this happens, so I’ve simply told it to count negative values as being zero — pitch black. Later on, I’ll also count values greater than one as being one — pitch white. Jeremy Kun notices the same tendency to darkness, but doesn’t really explain why: I suspect Mathematica, the software Kun uses in his post, does this “thresholding” automatically.

As an alternative to thresholding, I’ll often be linearly “rescaling” images, so that the smallest intensity goes to zero, and the largest goes to one. For the “difference faces”, since subtracting the mean face is also a linear transformation, rescaling means we get an image that’s pretty similar to the original.

The difference faces with rescaling.

So far, so good. Now for the eigenvectors, or “eigenfaces”. Remember, these represent the directions in which we expect variation to be independent, with highest-variance directions first. The photographs are $66$ by $58$ pixels in three colours, so these vectors have $66\times58\times3=11484$ elements each, and are each normalised so that the sum of the squares of their intensities is equal to one, so it shouldn’t be too surprising that their intensities are all close to zero. This makes thresholding useless, so we just show the rescaled version.

The eigenfaces, rescaled. The rescaling was done ignoring the last face, as this is essentially random noise, with a far larger intensity range, but no informative value.

These are certainly less sinister than those in Kun’s article. I assume this is the consequence of looking at such a homogenous set of people, and Türk doing a good job of positioning everyone in the same place in the frame. Such a small and cooperative data set makes this blogger a happy little theorist.

We wouldn’t really expect these eigenfaces to exactly align with introducing or removing different face features, but looking at them shows some features a lot more obviously than others. For example: the first eigenface appears to be determining asymmetry in lighting on the two sides of the face; the second one partially accounts for glasses, and how far down the hair comes over the forehead; eigenface nine accounts for asymmetry in how far the hair comes down on each side of the forehead, and also looks rather similar to the original face nine, since that is the only face with such a drastic asymmetry; and so on.

So, what can we do with these eigenfaces? Well, we can easily decompose a face into the mean face, plus a linear sum of the eigenfaces — twenty-five of them, here. If we want to reduce the size of the data, we can start throwing out some of the eigenfaces. In particular, since we’ve calculated how much variability there is along each eigenface, we can throw out the least variable ones first. This way, we minimise how much of the variance between the faces is thrown out, and so keep the faces as distinct as possible.

To illustrate this, we can introduce the eigenfaces for a face one at a time, to observe the effect on the image:

Progression for face one.

Progression for the more distinctive face nine.

Some faces will become clear faster than others, but it seems like both faces become close to the originals by, say, eighteen components. Indeed, if we only use eighteen components for each face, we get the following:

Class Average I, reconstructed using only eighteen of twenty-five components.

Faces like number ten are a bit vague, but that’s mostly pretty good.

So, what else can we do? Well, since we know how faces vary along each eigenface, we can try generating sets of new faces. The results can be, well, rather mixed. Sometimes the results look OK, sometimes they don’t look convincing at all.

This set doesn’t look too convincing to me.

This one looks better.

The one on the left’s been in the wars. The one on the right looks shocked to be here.

This is probability mostly due to my sampling the value of the eigenface components as independent normal distributions, which makes no sense in the context of the problem.

That’s about it for now. There are a few diagnostic plots I can show off once I find them again, allowing you to do things like assigning a quantity to how distinctive each face is from the rest of the set (nine and twenty-one stand out the most), and a more quantitive assessment of how many eigenfaces to throw out while keeping the faces distinguishable.

Principal Components Analysis, Eigenfaces, and Karhunen-Loève Expansions: Part One

Last time, we were talking about interpolation and orthogonal polynomials. For this series, we’re also going to end up talking about orthogonal vectors and orthogonal functions. But first, we’ll look at some old photos from the Seventies.

Principal Components Analysis
The Budapest National Gallery has its floors arranged in chronological order, so the first floor is given over to an overwhelming sea of medieval paintings of Bible stories and prophets. Well-painted as they are, when the founder of a country is a King Saint, you know you’re in for the long haul on this floor. However, the visitor who hasn’t run out of patience before they reach the floors for the 20th Century will find a cute pair of photographs by Péter Türk.

One is a collation of shots of Türk’s class. For the second, he chopped each of the shots into a grid of squares. The top-left squares from each shot have then been placed together in one large square. The squares of the next grid place along have then been placed alongside, and so on. The result is something that, from a distance, looks like a new face.

Class Average I, Péter Türk, 1979.

Class Average II, Péter Türk, 1979.

Today, this would be done by computer — a mechanical Türk, so to speak — and the squares would be blended together instead of being placed in blocks, but what we have here is some idea of a “mean face”, the average of all the individual photos. It’s not exactly like any of the individual photos, but clearly shows a lot of their shared features. Especially the hair. If we let the computer take the mean over each pixel position, rather than placing them next to each other, we obtain something that looks surprisingly human.

The true mean face, or as close as we can get with low-quality online scans of the originals.

Class Average II rescaled, for comparison.

It’s worth mentioning — not for the last time — how good a job Türk did at keeping his photographs consistent, with regards to face positioning. This is a small group of people, yes, but if we take the eyes as an example, the eye positions are so similar that the mean face still has pupils!

This tinkering is well and good, but consider what we might want to do with a large collection of such photos.
1. Say we wanted to make a digital database with thousands of these photos. Maybe we want to set up some sort of photographic identification system, where we compare a new photo, of someone requesting access to something, against the photos we have on file. How would we decide if the new photo was similar enough to one on the database to grant them access?
2. Along similar lines, suppose we’re not interested specifically in the people in the photos, but we are interested in finding some rules of thumb for telling between different people in the same population. How would we do so?
3. Now suppose we’d also like to compress the photos somehow, to save on memory. How should we do so while keeping the photos as distinguishable as possible?

One method we can use for this is Principal Components Analysis, which I’ll be talking about over the next few posts. However, here’s a brief taste of what it allows us to do, statistically rather than by guesswork:
1. PCA gives us a way to take a new photo, and make some measure of its “distance” from our originals. We can then decide that it’s a photo of the person in the closest original, or that it’s a new person if all the “distances” are too large.
2. The most important features for distinguishing between people in the above set of photos are the side of their face the light source is on, and how far down their fringe comes.
3. We can compress the photos in such a way that we know how much of the original variance, or distinctiveness between the photos, that we keep. If we don’t mind compressing by different amounts for different photos, we could also keep a set amount of distinctiveness for each photo, rather than across the whole group.
4. We can try — with variable levels of success — to generate new faces from scratch.

Also worth noting is that, apart from 4., all of this can be done with only a covariance estimate: we make no assumptions about the distribution the photos are drawn from.

We’ll come back to these photos, and these applications, later in the series. Next time, we’ll look at something a bit simpler first.

It’s been about five months since I said this post was in draft, so it’s about time I reined in my perfectionism and published the damn thing.

Since this is a graduate student blog at the moment, it seems reasonable I should write a bit more about what I’m learning at any particular time. Late last year, our department had a talk by Erik van Doorn, from the University of Twente, which looked at birth-death processes, and asymptotic results for how quickly they can be expected to converge to their equilibrium distribution. It was an overview of his paper Representations for the decay parameter of a birth-death process based on the Courant-Fischer theorem, to appear in the Journal of Applied Probability.

A good deal of the talk introduced and used basic results on orthogonal polynomials, so I went to see if any of my books mentioned the subject. It turned out there was a chapter on them in Approximation Theory and Methods by Michael Powell – a book that’s been on my bookshelf for about five years but hardly been read – regarding their use in Gaussian quadratures. The following is mostly spliced together from that chapter, and my undergraduate notes on Numerical Analysis.

Interpolation

Before we talk about quadratures, it’s best if we start with interpolation. Say we have some function over some interval, where we can take a few sample values, with no measurement error, but we have no explicit formula and can’t afford to sample it everywhere. We thus would like to use our sample values to fit an approximating function to the whole interval. One simple way to do this is to try to fit a polynomial through the sample points. We can do this by assigning each sample point a Lagrange polynomial

$l_k(x) = \prod_{n \neq k} \frac{x-x_n} {x_k-x_n} ,$

with value one at that sample point and zero at all the others. For example, if we take our sample points at -1,-0.5,0,0.5, and 1, then the Lagrange polynomials are those in the plot below. There’s a light grey line at one to help check they are equal to one or zero at the sample points.

Our fitted curve will then just be a sum of these Lagrange polynomials, multiplied by their corresponding sample value, so we get a polynomial passing through all the sample points, and estimate the function $f(x)$ as

$\hat{f}(x) = \sum_k f(x_k) l_k(x) .$

This gives a curve that passes through all the interpolation points with the smallest-order polynomial possible. It works well for estimating functions that are, indeed, polynomials, but for other functions it can run into problems. In particular, there are cases where the difference between the interpolation curve and the true function at certain points increases when we increase the number of sample points, so we can’t necessarily improve the approximation by adding points. There’s also the question of where to sample the original function, if we have control over that. I’ll pass over these issues, and move on to integration.

Now say that, instead of approximating a function with some samples, we want to approximate a function’s integral by sampling its value at a few points, or

$\int_a^b f(x) \, \textrm{d} x \simeq \omega_0f(x_0) +\omega_1f(x_1) +\ldots +\omega_nf(x_n) .$

If we want to focus on making the integration accurate when $f$ is a low-order polynomial, the quadrature with $n+1$ sample points is exact for polynomials up to order $n$ if we set the weights as

$\omega_k = \int_a^b l_k(x) \, \textrm{d} x.$

In other words, a quadrature is equivalent to fitting an interpolation curve, and integrating over it. For example, if we’re integrating a function over the interval $[-1,1] ,$ we could simply take one sample, with weight one. This would give a quadrature of $2f(x_0) ,$ which is exact for any zero-order, constant function, regardless of the position of $x_0.$

We could take samples at the endpoints, to get the quadrature $\frac{1} {2} f(-1) +\frac{1} {2} f(1) ,$ and we can set $f(x)$ to be constant, or proportional to $x,$ to see the result for first-order polynomials is exact.

We could also take the endpoints and the midpoint. Then we have $\frac{1} {3} f(-1) +\frac{4} {3} f(0) +\frac{1} {3} f(1) ,$ which is exact for polynomials up to order two.

However, occasionally we stumble onto a quadrature that does a little better than expected. For the first quadrature above, since our interval is symmetric around zero, if we let $x_0=0$ any first-order term will be antisymmetric around this midpoint, so this quadrature is exact for first-order polynomials too. Similarly, the second quadrature is exact for quadratic terms, but the third quadrature can still only deal with quadratics, and can’t handle cubics.

Considering what happened when we placed the sample points for the first quadrature at zero, we might guess this is something to do with where we place our sample points. If so, how should we place our sample points, and what’s the highest-order function we can exactly integrate with any set number of samples? To answer this, we can use orthogonal polynomials.

Orthogonal polynomials

We say two vectors are orthogonal when their inner product is equal to zero. For example, if the inner product is simply the dot product, then

$\langle x,y \rangle = \sum_{k=1}^n x_ky_k =0,$

and so vectors are orthogonal if they are perpendicular to each other.

We have a similar definition and example for orthogonal polynomials, but now we choose an inner product that integrates over an interval instead of summing over two vectors’ dimensions:

$\langle f,g \rangle = \int_a^b f(x)g(x)\,\textrm{d} x =0.$

We can then choose a sequence of polynomials with increasing order that are all orthogonal to each other. For example, we can start the sequence with $f_0(x)=1,$ or some multiple of it. We then seek a first-order polynomial $f_1(x)=Ax+B$ such that

$\int_a^b 1\times(Ax+B) \,\textrm{d} x =\frac{A} {2} (b^2-a^2) +B(b-a) =0.$

This can be any multiple of $x-(b+a) /2 .$ In many cases we wish the orthogonal polynomials to have be orthonormal, i.e. $\langle f_k,f_k\rangle =1,$ so for the above we require

$\int_a^b C_0^2 \,\textrm{d}x = C_0^2 (b-a) = 1,$

\begin{aligned} \int_a^b C_1^2(x-(b+a)/2)^2 \,\textrm{d}x &= C_1^2 \frac{1}{3} \left[\left(b-\frac{b+a}{2}\right)^3 -\left(a-\frac{b+a}{2}\right)^3\right] \\&= C_1^2 \frac{2} {3} \left(\frac{b-a} {2}\right)^3 \\&=1,\end{aligned}

and so on, giving a condition for the value of each scaling factor $C_k.$ We can then find the next term by looking for a second-order polynomial that is orthogonal to $1$ and $x,$ and so on. In the case where $a=-1$ and $b=1$ this gives a simple sequence of polynomials that begins with

$f_0(x)=\frac{1}{\sqrt{2}} ,\, f_1(x)=\sqrt{\frac{3}{2}} x,$

$f_2(x)=\sqrt{\frac{5}{2}} (\frac{3}{2} x^2-\frac{1}{2}) ,\, f_3(x) =\sqrt{\frac{7}{2}} (\frac{5}{2}x^3-\frac{3}{2}x),\ldots$

This is an orthonormal version of the Legendre polynomials.

Since any polynomial can then be expressed as a linear combination of members of this sequence, each polynomial in the sequence is also orthogonal to any polynomial with lower order. So, for example, $f_3$ is orthogonal to all constant, linear, and quadratic polynomials.

To be continued
The next post will explain why these orthogonal polynomials help us decide on interpolation points.

Approximate Bayesian Computation: Variance-Bias Decomposition

Now I’ve rambled about how to measure error, let’s relate it back to ABC. I mentioned previously that using ABC with a non-zero tolerance $\delta$ means our samples are taken from the density $p(\theta \,|\, \|S-s^*\| \leq \delta)$, instead of the true posterior $p(\theta \,|\, S=s^*)$ for a sufficient statistic S.

Say we write our estimate as $\hat{\theta} =\frac{1}{n} \sum_{i=1}^n \phi_i$, where each $\phi_i$ is an accepted sample. If we measure error as mean square error, then we can decompose the error as we did in the case of sampling from the wrong distribution:

$\mathbb{E}(L(\theta,\hat{\theta} ) \,|\, x^*) =\underbrace{\mathrm{Var} (\theta\,|\,x^*)}_{\textrm{True uncertainty} } +\underbrace{\frac{1}{n} \mathrm{Var} (\phi \,|\, x^*) }_{\textrm{Monte Carlo error} } +\underbrace{\mathbb{E} ((\mathbb{E} (\phi) -\mathbb{E} (\theta) )^2 \,|\, x^*) }_{\textrm{Square sampling bias} } .$

This is now conditional on the observed data, but this only changes the equation in the obvious way. For a graphical example, say the true posterior, and the ABC posterior our samples come from, look like this:

The true posterior density is, of course, a density with a non-zero variance rather than a single point. This describes the true uncertainty, i.e. what our estimate’s mean square error would be if our estimate was the optimal value $\mathbb{E} (\theta \,|\, S=s^*)$.

Next, imagine we could somehow calculate the ABC posterior, and so get its expectation $\mathbb{E} (\theta \,|\, \|S-s^*\| \leq \delta)$. Since the two expectations – the peaks, in the case shown in the picture above – are likely to not overlap, this estimate would have a slight bias. This introduces a sampling bias.

Finally, take the full case where we average over n samples from the ABC posterior. This now introduces the Monte Carlo error, since sampling like this will introduce more error due to the randomness involved. Note that $\mathrm{Var} (\phi \,|\, x^*) =\mathrm{Var} (\theta \,|\, \|S-s^*\| \leq\delta)$ will probably be larger than $\mathrm{Var} (\theta \,|\, x^*) =\mathrm{Var} (\theta \,|\, S=s^*)$, since $\|S-s^*\| \leq \delta$ provides less information than $S=s^*$.

A Quick Look at the Bias
Since the true uncertainty is not affected by our choice of $\delta$, I’m going to ignore it. In the paper, we never mention it, defining the MSE to be $\mathbb{E} ((\hat{\theta} -\mathbb{E} (\theta \,|\, S=s^*) )^2 \,|\, x^*)$, the sum of the other two error terms above.

We then have variance and square-bias terms, that we can consider separately. The bias is easier, so let’s start with that. First, note that the bias doesn’t depend on the number of samples we take, so we only need to calculate the bias of a single sample $\phi$. After a bit of thought, and denoting the acceptance region as the ball $B_{\delta} (s^*)$ and the prior total density for $\theta$ and $S$ as $p(\cdot,\cdot)$, we can write the bias as

$\mathbb{E} (\phi \,|\, s^*) -\mathbb{E} (\theta \,|\, s^*) =\dfrac{\iint_{s\in B_{\delta} (s^*) } t \, p(t,s) \, \textrm{d}s \, \textrm{d}t} {\iint_{s\in B_{\delta} (s^*) } p(t,s) \, \textrm{d}s \, \textrm{d}t} -\dfrac{\int t \, p(t,s^*) \, \textrm{d}t} {\int p(t,s^*) \, \textrm{d}t} .$

Unless we look at specific cases for the form of $(t,s)$, this is about as far as we can get exactly. To get any further, we need to work in terms of asymptotic behaviour, which I’ll introduce next time.

Variance-Bias, or The Decomposition Trick for Quadratic Loss

Say we’ve decided to judge our estimator $\hat{\theta}$ for some parameter $\theta$ by determining the mean square error $\mathbb{E} \left((\theta-\hat{\theta} )^2 \right) ,$ i.e. we are using a quadratic loss function. The nice thing about using mean square error, or MSE, to determine optimality of an estimator is that it lends itself well to being split into different components.

Variance and Bias
For example, we can expand the MSE as

$\mathbb{E} \left(L(\theta,\hat{\theta} ) \right) =\mathbb{E} \left((\theta-\hat{\theta} )^2 \right) =\mathbb{E} \left((\theta-\mathbb{E} (\theta) +\mathbb{E} (\theta) -\hat{\theta} )^2 \right) .$

Why add more terms? Because it leads to a useful intuition about the nature of the loss. Say we now split the expression into two, each with two terms, i.e.

$\mathbb{E} \left(L(\theta,\hat{\theta} ) \right) =\mathbb{E} \left((\theta-\mathbb{E} (\theta) )^2 +2(\theta-\mathbb{E} (\theta) ) (\mathbb{E} (\theta) -\hat{\theta} ) +(\mathbb{E} (\theta) -\hat{\theta} )^2 \right) .$

Since $\theta$ is the only random variable in the expression, the interaction term in the middle is zero, so the MSE splits into

$\mathbb{E} \left(L(\theta,\hat{\theta} ) \right) =\mathbb{E} \left((\theta -\mathbb{E} (\theta) )^2 \right) +\big(\hat{\theta} -\mathbb{E} (\theta) \big)^2 =\mathrm{Var} (\theta) +\mathrm{bias} (\hat{\theta} )^2 .$

Our expected loss is thus a combination of the uncertainty of our knowledge of $\theta,$ which we cannot do anything about, and the square of the bias of our estimator. Our optimal estimator, the mean, is thus the estimator that makes the bias equal to zero.

The nice thing about having an unbiased estimator like this one is that it is correct on average, i.e. it doesn’t have a tendency to either over- or under-estimate.

Imagine you’re firing a gun at a target. Assume, for the moment, that your aim is perfect! However, you’re testing a new gun whose performance is unknown. If your shots are tightly packed, i.e. have a small spread, then the variance of the shots is small. If they’re sprayed all over the place, the variance is high. If the cluster of shots is off-centre, they’re biased. If they’re on-target, or at least clustered around it, the bias is small, or even zero.

Having a small bias seems like a good thing. In fact, it seems like such a good thing that people often try to get unbiased estimators. This can turn out to be a bad idea, if it increases the variance too much.

Say we are at the firing range again. Suppose you had two guns to test. One has a tight spread, but shots are off-centre. The other’s shots are centred, but they’re scattered all over the place. If we were interested only in being unbiased, the second gun would be deemed superior, but this goes completely against how most people would evaluate the guns’ performances. If we could look at how the gun did, and adjust it for next time, The bias in the first gun can be compensated for by adjusting the sights, but the second gun is barely usable. So, we still need to take account of both variance and bias.

Monte Carlo Error
However, we’re not done yet! Say we don’t know what the expectation of $\theta$ is. Then we need to decide on some other choice of estimate $\hat{\theta} .$ Let’s say, for example, that while we don’t know the expectation, we can draw samples from the whole distribution. How about if we generated a few samples, and took their average as our estimate? Well, this estimator is random, so the MSE is now an expectation over the estimate as well as $\theta$ itself.

However, we can still split the error as we did above. We can even still get rid of the interaction term, since the estimator and the parameter are independent. So, we get

$\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\mathbb{E} ((\hat{\theta} -\mathbb{E} (\theta) )^2 ) .$

Now what? Well, the second term is the expected square difference between something random and something constant, as we originally had in the simple case before. So, let’s try splitting again! Inserting the expectation of the random variable worked well last time, so lets try that.

$\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\mathbb{E} ((\hat{\theta} -\mathbb{E} (\hat{\theta} ) )^2 ) +(\mathbb{E} (\hat{\theta} ) -\mathbb{E} (\theta) )^2 .$

We get a variance term and a bias term again, fancy that. So, what is $\mathbb{E} (\hat{\theta} ) ?$ Well, it’s the expectation for an average of independent samples, so it’s equal to the expectation for one of them, which is just $\mathbb{E} (\theta) .$ The bias term disappears.

Similarly, the variance of an average is the variance of a sample, over the number of samples. So, if we write the estimator as $\hat{\theta} =\frac{1} {n} \sum_{i=1}^n \phi_n,$ the MSE is

$\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\frac{1} {n} \mathrm{Var} (\phi) .$

So we get closer to the optimal MSE as we take more samples. Makes sense. There are also variations used to reduce the MC error, such as using non-independent samples, but I’ll leave off for now.

Sampling from the Wrong Distribution
We’re still not done. Say that the sampling estimator we used above is taking samples from the wrong distribution. How does this affect the error? Well, the variance of each sample might change, but, more importantly, the bias term probably won’t disappear:

$\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\frac{1} {n} \mathrm{Var} (\phi) +(\mathbb{E} (\phi) -\mathbb{E} (\theta) )^2 .$

One thing to note from this is that if we sample from a distribution with the same expectation, but with lower variance, we get a smaller MSE. The logical extreme is taking a distribution with zero variance. Then every sample is equal to the expectation, and we are just left with the natural parameter uncertainty.

So, we now have three different sources of error. One is the inherent uncertainty of what we’re trying to estimate. Another is Monte Carlo error, introduced by averaging over samples instead of using the expectation directly. Finally, there is sampling bias, introduced by taking our samples from a distribution different to the one we want.

That’s about as far as we can go for this example, but this technique can also be used for other problems. Just try the same tactic of splitting the MSE into independent sources of error, by adding and subtracting a term in the middle. Then we can find what the different sources of error are, which we have control over, and so on.

The good news, though, is that the above is all we need to talk about the error introduced by using ABC, so I’ll get back to that next time.

Measuring Error: What Are Loss Functions?

Last time I finished on the question of how we measure the error of an estimate. Let’s say we trying to estimate a parameter, whose true value is $\theta$, and our estimate is $\hat{\theta}$. If there were to be a difference between the two, how much would we regret it? We’d like some way to quantify the graveness of the error in our estimate. Specifically, we’d like to create some loss function $L(\theta,\hat{\theta})$. We could then determine how good an estimator is by calculating the resulting loss: a better estimator would have less loss, so the smaller the value of $L(\theta,\hat{\theta} )$ the better.

Now, there are some situations where our choice of loss function is obvious. An example would be if we’re selling a certain good, and we’d like to know how many of them to order in. We are then estimating the number of orders we’ll get before the next opportunity to restock. The loss function is then either proportional to the number of unfulfilled orders, if we understock, and the cost of storing the surplus, if we overstock.

In the more abstract case, where we’re estimating a parameter we will never observe, the choice of loss function isn’t as obvious. We’re not exactly charged money for making an inaccurate model. Instead, I’m going to suggest some properties we might want for the loss function, and then give a few examples.

If our estimate is exactly correct, obviously we wouldn’t regret it at all. In other words,

$L(\theta,\theta) =0 .$

Next, we’ll make some statements about symmetry, i.e. that we only care about the distance between the estimator and the true value, and not about the direction.

Say the empty circle in the middle of this number line is the true value. I propose that one property we’d like for our loss function is that the loss of the estimators at the two filled circles is the same, and that the loss of the estimators at the two empty squares is the same.

This is not a required property, and may not be desirable, depending on the problem. For instance, in the goods restocking example I mentioned above, the penalty for underestimating is often not the same as overestimating. One loses business, one just requires paying for longer storage for the surplus. Still, for the purposes of estimating some abstract model parameter on an arbitrary scale, I’d say assuming symmetry of loss is a reasonable property to assume.

I’d also say we’d like to depend on the distance, but not on the values, so the loss is some function of $\theta-\hat{\theta}$. Think of the loss function like a generalised voltmeter: it can measure the difference between a pair of points, but a single point has no meaning.

How about if we make two different estimates, and one is further from the truth? We’d want to penalise it at least as much as the other. In other words, if we have two estimates $\hat{\theta}_1$ and $\hat{\theta}_2$, and the distance $|\theta-\hat{\theta}_1|$ of the true value from the first estimator is smaller than the distance $|\theta-\hat{\theta}_2|$ from the second estimator, we’d like

$L(\theta,\hat{\theta} _1) \leq L(\theta,\hat{\theta} _2) .$

Of course, in practice we don’t know what $\theta$ is, so we try to minimise our expected loss $\mathbb{E} (L(\theta,\hat{\theta} ) )$. Usually we’d be minimising this expected loss based on some observations, but I’m keeping that out of the notation here for simplicity. Just assume the distribution we have on the parameter uses all our usable knowledge.

These properties leave a lot of options. Here are some of the more common ones.

0-1 Loss
Here the loss is simply equal to 1 if the estimator is different from the truth, and 0 if it’s not. This is pretty hard-line as loss functions go, because it considers being wrong to be so heinous that it makes no differentiation between different amounts of wrongness. Our expected loss is then simply the probability $\mathbb{P} (\theta\neq\hat{\theta} )$ of being wrong. Our optimal choice of estimator is then simply the most likely value of $\theta$. In other words, the mode is the optimal estimator for 0-1 loss.

There is also a similar case where the loss is 0 in a small region around the truth, and 1 outside it. The optimal estimator is determined by finding the point with the most chance of the truth being nearby, i.e. the middle of a highest-density region.

Absolute Difference
Here we take the loss function

$L(\hat{\theta} ,\theta) =|\theta-\hat{\theta} | .$

The seriousness of an error is thus proportional to the size of the error. In this case, the optimal estimator is the median.

In the case of there being several parameters, the median is also the optimal estimator when the expected loss is the expected Manhattan distance from the truth, i.e. the sum of the absolute differences for each parameter.

This is the most common loss function. For true value $\theta$ and estimate $\hat{\theta}$, the loss is

$L(\hat{\theta} ,\theta) =(\theta-\hat{\theta} )^2 .$

Large errors are considered far more serious here than in the case of absolute difference. This may, or may not, be a good idea. More on that in a minute. The expected loss, also called the mean square error, can be expanded as

$\mathbb{E} (L(\hat{\theta} ,\theta) )=\mathbb{E} (\theta^2-2\hat{\theta} \theta+\hat{\theta}^2 ) =\mathbb{E} (\theta^2) -2\hat{\theta} \mathbb{E} (\theta) +\hat{\theta}^2 .$

We want to choose our estimator $\hat{\theta}$ to minimise this expected loss. This is easily achieved by $\hat{\theta} =\mathbb{E} (\theta) .$ In other words, the (arithmetic) mean is the optimal estimator for quadratic loss.

In the case of several parameters, the mean is also the optimal estimator when the expected loss is the expected Euclidean distance from the truth.

This is the loss function I’ll be using from hereon. A few more comments before I finish.

Note that we have the “Big Three” of averages as the optimal estimators for the loss functions given above. The mode isn’t used that much, but absolute and quadratic loss can be useful for intuition about the difference between the mean and the median. Specifically, the median is less influenced by outliers than the mean. That can be important, because you might not want the outliers to count for much, especially if they’re suspected to be due to some observational error. This answer on Cross Validated addresses a good example.

We should also consider what we’re doing by choosing a loss function.

The obvious issue is that we’re making point estimates of a parameter, rather than making distributions or making predictions about future observables. I’ve briefly mentioned this before.

The other issue is that choosing a loss function can be subjective, to put it mildly. I suspect the main reason that the quadratic loss is the most common loss function is simply because means are easier to calculate, and it has nice properties in general. The same thing goes for how we decide what the optimal estimator is. I was describing the optimality of loss functions in terms of minimising the expected loss, i.e. the mean loss. But if we think absolute error is the better loss function, why would we would to think in terms of mean loss in the first place, rather than median loss? There is theory out there that considers the error of point estimates in terms of medians, but I have no experience with it whatsoever. Perhaps another time, this post is long enough already.

For now I’ll follow the idea that the mean is good enough in general. It’s easy, everyone knows how to calculate it, and quadratic loss has nice properties. Next post will look one of them, the variance-bias decomposition. It will also look at what happens when we can’t directly use the mean as our estimator, as is the case in Monte Carlo methods like ABC.

Approximate Bayesian Computation: Summary Statistics and Tolerance

Last time I introduced a basic example of the Approximate Bayesian Computation methodology for estimating posterior expectations in the case of unavailable likelihoods. One of the main issues I mentioned was that it could take a long time to accept proposals. This is caused by three factors.

1. The algorithm is being overly picky about its criteria for accepting proposals. In our Lucky Dip example, we saw that proposals were accepted when the generated play record for the players exactly matched the observed record. All we require is that the frequency of each possible outcome is the same between the two records. In other words, the algorithm is worried about the order in which the outcomes occurred when it doesn’t need to.

2. The outcome, even when stripping out the extraneous information mentioned in the point above, is so unlikely that generating a matching dataset will still take a long time. This is especially true if any element of the data is continuous, because the probability of generating an exactly matching dataset will be zero.

3. It takes a long time to generate datasets due to the complexity of the model. There is little we can do about this, outside of making sure our model is no more complicated than is necessary.

We thus need to consider stripping out superfluous information, and accepting proposals whose data is merely close to the observed data, to mitigate factors 1 and 2 respectively.

1. Summary Statistics

If we flip a coin several times to decide whether it’s fair, we obviously don’t care about the order in which the heads and tails appear. We just care about how often they each turn up. We could also just look at the proportion of flips that come up heads, without recording how many flips we observed. These are both summary statistics, that take the original data and express it in a different, usually smaller, form. Formally, we take the data in the form of the random variable $X$, and calculate the summary statistic $S=S(X)$, which is also a random variable.

Some summaries are better than others. A higher number of flips increases the accuracy of the result, but our second summary above doesn’t keep a record of this. In other words, only recording the proportion of flips that come up heads loses relevant information, so we should be using a summary such as the first one, which does not.

Ideal summary statistics, that don’t lose relevant information, are referred to as sufficient statistics. Formally, the posterior of the parameters given a sufficient statistic is equal to the posterior given the original data, for any possible values of the data and statistic:

$p(\theta | S=s^*=S(x^*) ) =p(\theta | X=x^*) \textrm{ for all } \theta,x^*: p(\theta,x^*) >0.$

The original data, of course, is itself a sufficient statistic. The best-case choice would be what is called a minimal sufficient statistic, which is a sufficient statistic with the smallest possible number of dimensions.

Minimal sufficient statistics are desirable, because, as the number of dimensions increases, the chance of hitting any region inside the space will generally decrease. This is known as the curse of dimensionality, and is why we often want to decrease the dimensions as much as possible.

Here’s a lazy analogy. Consider a square, that contains a circle touching the sides of the square. Now say we choose a random point inside the square, with each point equally likely. The probability of the point being inside the circle is equal to the proportion of the square’s area it takes up, which is $\pi/4$.

Now add another dimension, so that we are choosing a point inside a cube, and seeing if it’s inside a sphere. That sphere now takes up less of the available space: the probability of choosing a point inside it is $\pi/6$. The probability decreases as the number of dimensions increases: for $q$ dimensions the probability is $\frac{\pi^{q/2} } {2^q\Gamma(q/2+1) }$. Obviously, we’re seldom trying to hit a region so large compared to the entire space, but hopefully you get the idea that more dimensions means less chance of success.

This leads us to the following ABC algorithm.

ABC2
1. Decide on the acceptance number $n$.
2. Sample a proposal $\hat{\theta}$ from the prior $p(\theta|M)$.
3. Calculate the observed statistic $s^*=S(x^*)$.
4. Generate a dataset $\hat{x}$ from the likelihood $p(X|M,\hat{\theta} )$.
5. Calculate the statistic $\hat{s} =S(\hat{x} )$ for the generated dataset.
6. Accept $\hat{\theta}$ if $\hat{s} =s^*$, else reject.
7. Repeat steps 4-6 until $n$ proposals have been accepted.
8. Estimate the posterior expectation of $\theta$ as $\mathbb{E} (\theta|M,X=x^*) \simeq\frac{1} {n} \sum_{k=1}^n \hat{\theta}_k$, the mean of the accepted proposals.

Since the proposals are only accepted when the generated statistic is equal to the observed statistic, the accepted proposals are taken exactly from the distribution $p(\theta|S=s^*)$. This is equal to the posterior distribution $p(\theta|X=x^*)$ if the statistic is sufficient.

Let us consider the Lucky Dip example again. The observed player record is $(3,0,0,2,3)$, and we can show that counting the frequency of each outcome – in this case $(2,0,1,2)$ – is sufficient. In fact, we can show it’s minimal sufficient. More on this later. For a tank of $f$ fish, the ABC method for approximating the number $r$ of red fish is the following.

ABC2FISH
1. Decide on the acceptance number $n$.
2. Sample a proposal $\hat{r} \in \{0,1,2,\ldots,f\}$, with each possibility equally likely.
3. Generate a play record of five players, where each player starts with $\hat{r}$ red fish and $f-\hat{r}$ blue fish in the tank.
4. Count the number of losers, and the number of players that win on the first, second, and third draws, to calculate the generated statistic.
5. Accept $\hat{r}$ if the generated statistic matches the observed statistic $(2,0,1,2)$, else reject.
6. Repeat steps 2-5 until $n$ proposals have been accepted.
7. Estimate the posterior expectation of $r$ as $\mathbb{E} (r|M,X=(3,0,0,2,3)) \simeq\frac{1} {n} \sum_{k=1}^n \hat{r}_k$, the mean of the accepted proposals.

This is like the ABC1FISH algorithm we used before for the same problem, but uses a summary statistic with four dimensions instead of the five-dimensional data. That isn’t a great reduction, but consider that, if the play record was longer, the statistic wouldn’t increase in size. For large amounts of observed data, this statistic will thus effect a large reduction in dimensionality.

Again, I took a hundred ABC estimates for each value of $n$. The results are given below.

1. Stripping away irrelevant information should give similar results in less time. Comparing the above boxplot with the one for ABC1FISH will show the results to be roughly the same. Since we haven’t lost any relevant information, this is to be expected. However, where ABC1FISH took a day or two to run, ABC2FISH took a few hours.

2. It is simple to find a sufficient statistic for such a simple problem. In fact, if the likelihood is known we can also find the minimal sufficient statistic, since this is simply a matter of listing all the expressions in the likelihood in which the data elements appear.

For the example above, the likelihood of a single play, for a given parameter value, depends on the outcome. Thus, the total likelihood of all the plays is the product of the single plays, so will be equal to the product of likelihoods associated with each outcome, each to the power of the number of plays with that outcome:

$\mathbb{P}(x_0 \textrm{ losers, } x_1 \textrm{wins on first draw, etc.} | \theta) =\prod_{i=0}^3 \mathbb{P}(\textrm{Result } i | \theta)^{x_i} .$

Without the likelihood, this is less simple, and often impossible. There’s a growing amount of research on optimal choice of summary statistic, but the chosen statistic will usually not be sufficient, and the ABC estimate will converge to the wrong value as the accept number increases. However, the increase in acceptance probability is considered worth it, as having a higher acceptance rate allows more accepts – a larger $n$ – for the same running time, which will decrease the error up to a point.

3. If any relevant element of the data is continuous, this still isn’t good enough, because our chance of getting any value for the statistic is still zero. This requires other measures, such as the concept of tolerance introduced below. Given such measures, summary statistics are still useful for the same reasons, so they are worth explaining first.

2. Tolerance

If the data is continuous, the probability of getting an exact match between datasets is zero, so at some point you have to say “close enough”. One way to do this is to decide on some way to measure the distance between two datasets, and accept the proposal if this distance is less than a certain value.

For example, say we define the distance between two datasets as the Euclidean distance, i.e. calculate the difference between each pair of respective elements, and take the square root of the sum of their squares. Then we’d accept any simulated data that lies within a ball around the observed data, with a radius equal to the tolerance. In one dimension this would be a symmetric interval, in two a circle, in three a sphere, and so on. Perhaps this would be a better place to explain the curse of dimensionality in terms of balls in boxes, but no matter.

Introducing tolerance leads to the following algorithm.

ABC3
1. Decide on the acceptance number $n$, the distance metric $\|\cdot\|$ , and the tolerance $\delta$.
2. Sample a proposal $\hat{\theta}$ from the prior $p(\theta|M)$.
3. Calculate the observed statistic $s^*=S(x^*)$.
4. Generate a dataset $\hat{x}$ from the likelihood $p(X|M,\hat{\theta} )$.
5. Calculate the statistic $\hat{s} =S(\hat{x} )$ for the generated dataset.
6. Accept $\hat{\theta}$ if $\left\|\hat{s} -s^*\right\| \leq \delta$, else reject.
7. Repeat steps 4-6 until $n$ proposals have been accepted.
8. Estimate the posterior expectation of $\theta$ as $\mathbb{E} (\theta|M,X=x^*) \simeq\frac{1} {n} \sum_{k=1}^n \hat{\theta}_k$, the mean of the accepted proposals.

The accepted proposals are now taken from the distribution $p(\theta|\|S-s^*\|\leq\delta)$, and we hope this is close enough to the true posterior to not introduce too much error.

Again, we consider the Lucky Dip example again. We choose the Euclidean distance metric, and the ABC method for approximating the number $r$ of red fish is now the following.

ABC3FISH
1. Decide on the acceptance number $n$ and the tolerance $\delta$.
2. Sample a proposal $\hat{r} \in \{0,1,2,\ldots,f\}$, with each possibility equally likely.
3. Generate a play record $\hat{x}$ of five players, where each player starts with $\hat{r}$ red fish and $f-\hat{r}$ blue fish in the tank.
4. Count the number of losers, and the number of players that win on the first, second, and third draws, to calculate the generated statistic $\hat{s}$.
5. Accept $\hat{r}$ if $\|\hat{s} -(2,0,1,2) \|\leq\delta$, else reject.
6. Repeat steps 2-5 until $n$ proposals have been accepted.
7. Estimate the posterior expectation of $r$ as $\mathbb{E} (r|M,X=(3,0,0,2,3)) \simeq\frac{1} {n} \sum_{k=1}^n \hat{r}_k$, the mean of the accepted proposals.

Here are some results, with the tolerance set to one. This means a proposal is accepted if the simulated data is equal to the observed data, or one element is off by one. The acceptance region is technically a ball, but since the data is discrete on a uniform grid, a tolerance of one results in acceptable datasets being in the shape of a cross.

Here’s another with the tolerance set to 12, which I’ll call ABC3aFISH.

1. Since the data in the example is discrete, and the space of possible statistics is small, using a non-zero tolerance is probably not justified. Still, it’s good enough to illustrate the idea.

2. ABC3FISH took an hour and three quarters, not much less than the zero-tolerance ABC2FISH algorithm. ABC3aFISH, on the other hand, took a minute and a half.

3. While ABC3FISH converges roughly to the same answer, ABC3aFISH converges to the prior estimate of $50\%$.This is because the tolerance is so high that it always accepts. Since this makes the condition $\|S-s^*\|\leq\delta$ tautological, ABC3aFISH is effectively sampling from the prior. Indeed, the leftmost box, shows individual proposals to be roughly evenly distributed across the entire range, as expected from our flat prior.

On the other hand, ABC3FISH and ABC2FISH are slower, but their sampling is closer to the posterior. Setting the tolerance is thus a balance between the number and the accuracy of accepted proposals, or between the known prior and the unknown posterior.

3. Choices of Tolerance

Now we can reduce the computation time if needs be, let’s think about how to keep our estimates accurate while doing so. Specifically, what value should we choose for the tolerance?

I mentioned at the beginning of the post that a higher acceptance rate made up for non-sufficient statistics, up to a certain point. The same is true for non-zero tolerances. As the number of proposals increases, the estimate will tend towards the incorrect posterior expectation. Decreasing the tolerance would result in the estimate tending towards a more-correct expectation, but the lower rate of acceptance would mean the convergence would be slower.

Say we draw the expected error against number of proposals – computation time – for both choices of tolerances. The curve for each would decrease as computation time increases, but flatten out at a certain level. The lower tolerance’s curve would fall more slowly at first, but eventually overtake the other tolerance’s curve as the latter flattens out, eventually flattening out at a lower level.

The lower tolerance will thus give a better estimate once the number of proposals is high enough. How large is this critical number? Who knows? Outside of simple examples – where the real answer is known, and so is the error of estimates – this is infeasible to find. Still, we can at least say that the optimal choice of tolerance decreases as the number of proposals increases.

What if we want something more specific? How about how quickly the optimal tolerance drops, or how quickly the error drops? It depends on how the error is defined, which may vary between individual problems. I’ll introduce one definition next time. The introductory overview’s done, unless I write about more complex variants of ABC some time. Things become more specific, and mathematical, from here.

Introduction to Approximate Bayesian Computation

A paper called The Rate of Convergence of Approximate Bayesian Computation, of which I’m a co-author, went online recently. As such, I’m going to break from beginner-friendly posts for a while, and outline the subject that the paper’s about. My goal is to work up to talking about the results proved in the paper.

We’re going to assume we want to make inference about some process, using some observations $x^*$. An example of observations would be our observing the outcome for previous players in a game of Pólya’s Lucky Dip. To understand the process, we have at least one model $M$ of how the process works, which takes some model parameters $\theta$ and generates some observations $X$.

We obviously need to work out whether the model is correct, but we’d also like to work out what the model parameters are if it is correct. This would then allow us to have a more complete model of how the process works. More importantly, knowledge of the parameters would allow us to make predictions about data observed in the future. We can thus forecast future results, and also use results we observe later to decide whether the model is adequately describing reality. Compare this to just predicting the values of the model parameters, which aren’t observable, and usually don’t exist in reality.

Let’s discuss the inference of the model parameters first. Under the Bayesian framework, we say that our belief in the values of the parameters, before making observations, has the prior probability density $p(\theta|M)$. We then want to update our belief after making observations, giving the posterior density $p(\theta|M,X=x^*)$. The usual way to find this is to use Bayes’s Theorem:

$p(\theta|M,X=x^*) =p(\theta|M) \frac{\mathbb{P} (X=x^*|M,\theta) } {\mathbb{P} (X=x^*|M) } .$

One possible problem is that the scaling factor $\mathbb{P} (X=x^*|M)$ can’t be calculated exactly. There are various ways to deal with this, such as doing numerical integration on $\int p(\theta,X=x^*|M)$. I won’t discuss these any further. Another, more complicated problem is when we cannot calculate the likelihood $\mathbb{P} (X=x^*|M,\theta)$ – a measure of how likely the observed data was to occur in comparison to different possible data – or when doing so is computationally infeasible.

This occurs rather often, because, if we wish to make inferences about a process that isn’t easily simplified, the way in which data is generated by the model can be exceedingly complicated. Typical problems we’d be looking at in this context are: trying to work out when different species of animal diverged from each other by examining fossilised DNA samples; trying to work out how modern humans spread out of Africa; analysing how the SARS virus spread in Hong Kong. Messy problems, complicated problems, problems that can’t be reduced to a model with simple likelihoods without losing important properties of how the process works.

So, what can we do instead? We can try some sort of Monte Carlo method. Approximate Bayesian Computation, or ABC, is one of the more naïve methods, because it doesn’t assume anything outside of what I mentioned above: the model $M$, the prior $p(\theta|M)$, and a method to generate data from model parameters, which is described by the incalculable likelihood $p(X|M,\theta)$ for any data value $X$.

There are about as many variants of ABC as there are people using it, if not more. I’ll give a few basic versions as we go. First, I’ll outline the basic philosophy: we generate samples of the model parameters $\theta$ from the prior distribution, then put these into the model to generate a sample dataset. We then compare this sample dataset to the observed dataset to decide how to use the proposed parameter values. We do this for a bunch of generated parameter proposals. Informed by the data comparisons, we then use these proposals to make an estimate of whatever we want to know about the model parameters.

There’s a lot of leeway in that description, so here’s a very basic example:

ABC1
1. Decide on the acceptance number $n$.
2. Sample a proposal $\hat{\theta}$ from the prior $p(\theta|M)$.
3. Generate a dataset $\hat{X}$ from the likelihood $p(X|M,\hat{\theta} )$.
4. Accept $\hat{\theta}$ if the $\hat{X} =x^*$, else reject.
5. Repeat steps $2-4$ until $n$ proposals have been accepted.
6. Estimate the posterior expectation of $\theta$ as $\mathbb{E} (\theta|M,X=x^*) \simeq\frac{1} {n} \sum_{k=1}^n \hat{\theta}_k$, the mean of the accepted proposals.

Since the proposals are only accepted when the generated data is equal to the observed data, the accepted proposals are taken exactly from the posterior distribution $p(\theta|X=x^*) .$

For example, say we’re again trying to get red fish out of a tank of red and blue fish. There are $f$ fish, and our prior on the number of red fish in the tank is the flat prior $p(r)=1/(n+1)$ for $r\in\{0,1,2,\ldots,n\} .$ Each player has three attempts to draw out a red fish, before he replaces his fish and the next player has a go. We observe several players, with a play record such as, say,

$(3,1,2,1,0,1,\ldots) ,$

where $0$ indicates a player losing, and a number in $\{1,2,3\}$ indicates a player winning on that draw. Our model $M$ will be the assumption that each remaining fish in the tank has an equal chance of being drawn each time.

Let’s say our play record is $(3,0,0,2,3) .$ Then our simple ABC method for approximating the number $r$ of red fish becomes the following.

ABC1FISH
1. Decide on the acceptance number $n$.
2. Sample a proposal $\hat{r} \in \{0,1,2,\ldots,f\} ,$ with each possibility equally likely.
3. Generate a play record of five players, where each player starts with $\hat{r}$ red fish and $f-\hat{r}$ blue fish in the tank.
4. Accept $\hat{r}$ if the generated play record matches the observed record $(3,0,0,2,3)$, else reject.
5. Repeat steps $2-4$ until $n$ proposals have been accepted.
6. Estimate the posterior expectation of $r$ as $\mathbb{E} (r|M,X=(3,0,0,2,3)) \simeq\frac{1} {n} \sum_{k=1}^n \hat{r}_k$, the mean of the accepted proposals.

I took a hundred such estimates for several values of $n$. The results are given as boxplots below. As expected, the boxplots appear to be sized proportionally to $\sqrt{n}$.

1. It can take a long time to accept any proposals if the play record is large, or is a record we would consider an extremely rare occurrence under the prior. For example, even for such a seemingly simple problem as this one, the above picture took a day or two to produce, so one estimate with $n=1000$ would have taken a few hours.

This is partly because the algorithm is being overly picky about whether to accept a proposal. It wants the record to be exactly the same, with the same order, when all we want is for the frequency of each player outcome to be the same. In this case, that means two losers, no winning first draws, one winning second draw, and two winning third draws. This can be mitigated by introducing summary statistics, which I’ll talk about next time.

2. Our estimate is in the form of a point estimate rather than an interval. That means it doesn’t give us any idea of how uncertain we are about the resulting estimate. This can be mitigated by using the accepted proposals to make an estimate of the posterior density rather than the expectation. Obviously we shouldn’t just assume there’s no uncertainty at all.

3. Both of the above can lead to making an estimate from a small number of accepted proposals. This can lead to some stupid reasoning if we assume there’s no uncertainty in the estimate. For example, say our record is $(1,1,1,1,1,1) ,$ i.e. we had six players, and they all happened to win immediately. We’d expect the resulting estimate to tend towards higher number of fish. However, if our only accepted proposal happens to be $\hat{r} =1,$ our resulting estimate would be guessing the number of red fish to be as small as practically possible. A result this heinous won’t happen too much, and the chance of it happening will go down as we accept more proposals, but with a point estimate we have no idea how likely events like this are for a particular acceptance number.

4. Since the amount of proposals generated to accept $n$ of them is random, the computation time of the ABC method above is random. This is often inconvenient for practical purposes, since we might be on time constraints for coming up with an estimate. A common alteration, is to set the number $N$ of generated proposals instead, in which case $n$ is random. I won’t talk about this much, partly to avoid discussing what happens when no proposals are accepted.

5. I’ve said nothing at all about making forecasts about future observables, which is unsatisfactory for the reasons I gave in the third paragraph. This doesn’t really get done with ABC, and it’s not clear whether there’s an efficient way to do so, since the usual way to make forecasts requires knowledge of the likelihood. You can’t use all the records generated, since they’re distributed based on the prior, and you can’t use the ones associated with the accepted proposals, since they’re all equal to our observed record. That would lead us to conclude that future play records are certain to be the same as the one we observed, which is absurd. My first guess would be to take each of the accepted proposals, and generate a set number of new records for each one, but this is infeasible for the more complex problems ABC is usually used for.

To sum up, ABC is simple to use, because it doesn’t use any assumptions outside of the prior and how to generate data from parameters under the model assumptions. However, we pay for the naïvety of this method by getting a point estimate with no sense of uncertainty, and, more importantly, with its taking a long time to calculate anything. The latter is particularly heinous – if the simple problem above would take a few hours to get a decently accurate estimate, imagine how long it takes to estimate how modern humans spread out of Africa. I’ll talk about ways of cutting the computation time in the next post.

Monte Carlo Example: Pólya’s Lucky Dip

Edit (2013/10/04): Under the first picture below, I mention a line which should be on all the pictures, but isn’t. This line should be at around 25 for all of them.

Probability has a few standard analogies. Let’s get to grips with one of them.

Let’s say we sit Ronald down in front of a bucket of red and blue magnetic toy fish, and the red fish have a prize written on them. He then catches one of the fish with a magnetic rod. It turns out to be blue, so he stores the fish by clipping it to his beard, and tries again. Assume the rather unlikely case where he knows that there are ten red fish and a hundred blue ones. What’s the chance of winning if he can catch up to three fish?

The standard mathematical way to solve this is to compare the picking-out of fish to one of the classic examples of picking balls out of an urn. But, frankly, it’s early in the morning, and I don’t want to deal with binomial coefficients before I’ve had a few drinks. What I could do with is convincing Ronald to sit around playing lucky dips all day, and see how often he wins. Since it’s a bit late to ask Ronald to fish, I’ll use a computer instead.

If we’d had Ronald to do this, I’d have to choose how many times he got to have a go. Since we’re using a computer, I’m just going to pick a few different numbers, and see what difference it makes. After a break for a drink, this is what I came back to.

The dots are the guesses, the line is what I know the true probability to be. However, if I run this again, the result can be very different.

Our guesses have a random element to them, so that shouldn’t be surprising. If I let Ronald play ten times, and then ten times again, the two sets of results needn’t have the same number of winners. What this means is that, if I make several guesses with the same number of plays, they’re going to be spread out. Since, in practice, we’re only going to make one guess, we’d like the spread to be pretty small. Hopefully, we can achieve this by increasing the number of plays.

I reworked the program to take a hundred guesses for each number of plays, and then use them to draw boxplots. After a bigger drink, I came back to this.

If you’re not used to boxplots, half of the estimates are inside the box, and the other half are inside the dotted lines. Dots are outliers that I’m just going to ignore here. The boxes get smaller as the number of entrants increases, and the box for 10000 plays is tiny. In other words, increasing the number of plays decreases how spread out the estimates will be.

What we can say is that, if you wanted your guess to be precise to within a percentage point or two, you’d need to simulate about ten thousand goes. Maybe it’s just as well we didn’t ask Ronald, I’m out of drinks to bribe him with for that long.

Does the box shrink towards the correct value? We don’t know, unless we work it out. In this case, my throat is now wet enough that I feel up to working it out on paper. In this case the true probability of winning is $\frac{82}{327}$, or about 25%, so it looks like the guesses tend towards the correct value as we increase the number of plays. It also looks like this would be a terrible lucky dip from the point of view of the person paying for the prizes, but never mind.

This is mainly because we know how to make direct guesses. By that, I mean the new data we’re generating is a direct statement about what we think the probability is. What we didn’t have to do, for example, was to be given two sets of Ronald’s win frequencies, and have to guess at whether he was fishing from the same bucket each time. We could generate more win frequencies, but those aren’t something we can directly use as a statement of whether or not the bucket is the same.

This requires more clever methods, and these more clever methods don’t necessarily tend to the correct answer if we run them for longer. That might be because the method we decide to use is very good. It might be because the data we can generate is so uninformative about the answer, that deriving one is going to introduce errors, regardless of what we do. But that deserves a separate post.

Why didn’t I just have a few drinks first, and go straight to getting the right answer? Well, I could have done, but sometimes you don’t have that option. More on that another time.

What are Monte Carlo Methods?

“Monte Carlo methods” is a technical term for a technique you’ve used since you were a small child.

Say you flip a coin. How often do you expect it to come up heads? Well, if we assume the coin won’t land on its edge, each face has the same chance of coming up, so half the time.

Say you roll a six-sided die. How often do you expect it to land on six? Same sort of argument. One in six.

Now imagine you’re a kid, and I again ask you how often you expect to roll a six. By kid, I mean someone who won’t be able to work out the exact answer given above.

How would you try to answer this? I suspect you’d roll the die a couple of times, see how large a proportion of the rolls come up sixes, and use that as a guess.

If so, congratulations! You discovered Monte Carlo methods even before your enthusiasm for maths was killed by algebra.

That’s really all Monte Carlo methods are: instead of working out the exact answer, you take a few random samples and use them to make a guess.

If it’s too much of an imaginative stretch to need to do this for such a simple problem, say I put a board game in front of you where you move your piece according to a roll of several dice. I then ask you how often you’ll land on a particular square by the end of the game.

To guess at an answer, you can then play the game a couple of times, and keep track of how many times you land on the square in each game.

Past a certain point in schooling, it’s easy to forget about making estimates like this. The questions usually posed to you, when being taught algebra and probability, are set up so that you can obtain the exact answer by hand. You can’t make a Monte Carlo estimate in an exam, using just pencil and paper. It’s not technically illegal to take dice into an exam, but rolling them will drive everyone else in the room crazy.

Once you look at real-life applications, however, you often find problems where you just can’t find the exact answer. The maths involved might be horrendous, or impossible. You simply mightn’t know everything you’d need to know to calculate the answer, because the needed information is impossible to observe.

So, what do you do? Well, you go back to rolling dice. At least, you get a computer to roll dice for you.

Usually Monte Carlo methods are a bit more advanced – in particular, we can often get some idea of how accurate we expect the estimate to be – but this constitutes the basics.

Hang on, though. We need to generate a lot of random samples for this to work, right? Won’t that take a while?

Yes, it will. Monte Carlo methods can take a long time to give an accurate. If your problem involves, say, predicting the weather, then the equivalent of rolling a die is now running a complete weather simulation. Depending on how sophisticated the model is, and how far ahead you’re forecasting, that can take hours, or days.

However, this is much less time than we’d expect to take finding the exact solution. The kid takes a while to roll his dice, and he would have spent less time finding the exact answer if he knew how to calculate it. Since he doesn’t, however, what we really need to compare against is how long he’d wait before he could work out how to do so. That’s probably years away. What if you need an answer now? Get rolling.

There are ways we can reduce the time these estimates take. Different methods make better use of the samples, so we can get away with using less samples, and still estimate with the same accuracy. Some are easier to implement. Some are specially designed to work very well for whatever particular problem the designer is working on. Some make better use of the way that modern computers are built – I’ll talk about this in another post. It’s a very general framework, so there’s a lot of flexibility.

Since there is a lot of flexibility, what you use depends on the problem. So, we’d better look at one. Next time, I’ll give an example of using these methods on a slightly more complex set of problems, commonly known as Pólya’s Urn. That’s a technical term mathematicians invented to justify playing at lucky dip boxes all day, on which we’ll be using a technique you’ve used since you were a child. Statistics is a very serious subject.