# Principal Components Analysis, Eigenfaces, and Karhunen-Loève Expansions

Principal Components Analysis

Instead of photographs, let’s suppose we’re just looking at vectors, i.e. points in space, randomly drawn from some unknown distribution.

Part of the crabs dataset in R’s MASS package.

In the image above, the points clearly vary most along some diagonal line. There are a few things we might want to do with this:

1. We might want to plot the data so that the axes represent key features. In this case, we’d like to rotate the plot, so that this diagonal line is along the horizontal axis.

2. We might want to plot the data so that a point’s position on one axis is linearly independent of its position along the other ones. In this way, we could generate new points from the distribution without having to overly worry about covariance issues, because we can generate the position along each axis separately.

Happily, PCA does both of these.

The new perpendicular axes. Note one appears to go along some sort of central line in the data.

The data plotted with respect to the new axes. The data’s variance clearly increases from left to right, but the covariance is zero.

I might get round to looking at the mathematics behind how PCA does this in a later post. For now, let’s see how PCA works for our photographs.

Eigenface Decomposition

The first step is to find the mean face, which we did last time. If we now subtract the red, blue and green intensities for pixels of this mean face from those of the originals, we get this shifty group.

The “difference faces”, with thresholding.

These faces look very dark. The reason for this is fairly simple: when we subtract the mean face, a large number of the colour intensities will end up being negative, rather than in [0,1]. R refuses to plot anything when this happens, so I’ve simply told it to count negative values as being zero — pitch black. Later on, I’ll also count values greater than one as being one — pitch white. Jeremy Kun notices the same tendency to darkness, but doesn’t really explain why: I suspect Mathematica, the software Kun uses in his post, does this “thresholding” automatically.

As an alternative to thresholding, I’ll often be linearly “rescaling” images, so that the smallest intensity goes to zero, and the largest goes to one. For the “difference faces”, since subtracting the mean face is also a linear transformation, rescaling means we get an image that’s pretty similar to the original.

The difference faces with rescaling.

So far, so good. Now for the eigenvectors, or “eigenfaces”. Remember, these represent the directions in which we expect variation to be independent, with highest-variance directions first. The photographs are $66$ by $58$ pixels in three colours, so these vectors have $66\times58\times3=11484$ elements each, and are each normalised so that the sum of the squares of their intensities is equal to one, so it shouldn’t be too surprising that their intensities are all close to zero. This makes thresholding useless, so we just show the rescaled version.

The eigenfaces, rescaled. The rescaling was done ignoring the last face, as this is essentially random noise, with a far larger intensity range, but no informative value.

These are certainly less sinister than those in Kun’s article. I assume this is the consequence of looking at such a homogenous set of people, and Türk doing a good job of positioning everyone in the same place in the frame. Such a small and cooperative data set makes this blogger a happy little theorist.

We wouldn’t really expect these eigenfaces to exactly align with introducing or removing different face features, but looking at them shows some features a lot more obviously than others. For example: the first eigenface appears to be determining asymmetry in lighting on the two sides of the face; the second one partially accounts for glasses, and how far down the hair comes over the forehead; eigenface nine accounts for asymmetry in how far the hair comes down on each side of the forehead, and also looks rather similar to the original face nine, since that is the only face with such a drastic asymmetry; and so on.

So, what can we do with these eigenfaces? Well, we can easily decompose a face into the mean face, plus a linear sum of the eigenfaces — twenty-five of them, here. If we want to reduce the size of the data, we can start throwing out some of the eigenfaces. In particular, since we’ve calculated how much variability there is along each eigenface, we can throw out the least variable ones first. This way, we minimise how much of the variance between the faces is thrown out, and so keep the faces as distinct as possible.

To illustrate this, we can introduce the eigenfaces for a face one at a time, to observe the effect on the image:

Progression for face one.

Progression for the more distinctive face nine.

Some faces will become clear faster than others, but it seems like both faces become close to the originals by, say, eighteen components. Indeed, if we only use eighteen components for each face, we get the following:

Class Average I, reconstructed using only eighteen of twenty-five components.

Faces like number ten are a bit vague, but that’s mostly pretty good.

So, what else can we do? Well, since we know how faces vary along each eigenface, we can try generating sets of new faces. The results can be, well, rather mixed. Sometimes the results look OK, sometimes they don’t look convincing at all.

This set doesn’t look too convincing to me.

This one looks better.

The one on the left’s been in the wars. The one on the right looks shocked to be here.

This is probability mostly due to my sampling the value of the eigenface components as independent normal distributions, which makes no sense in the context of the problem.

That’s about it for now. There are a few diagnostic plots I can show off once I find them again, allowing you to do things like assigning a quantity to how distinctive each face is from the rest of the set (nine and twenty-one stand out the most), and a more quantitive assessment of how many eigenfaces to throw out while keeping the faces distinguishable.

# Let’s try this blog thing again

I haven’t posted anything for quite a while — the thesis wasn’t going to write itself — so I’m going to make another stab at this. In the meantime, some short observations:

1. Looking back at old posts, I probably stopped writing because I was worrying too much about writing rigorously enough. That’s not stall what I do in person, so I’ll try to loosen the writing up a bit. I might then finish the next PCA post some time in the next two years.

2. There’s a decent backlog of mathematical ideas I can write about. I went on an input-output economics and central place theory kick for a little while. The short postdoc I’ve been doing has introduced me to some aspects of applied statistics, such as sensitivity analysis, and how irritating MCMC is to tune. The thesis is done and will soon be freely available online, so I can try to write about that a bit too.

# A Short Note on Tesla and the Electrocution of Animals

Those with a high tolerance for the internet’s habit of deifying great scientific minds — especially when done in obnoxiously long infographics — might remember the Oatmeal making sounds a few years ago about the endless virtues of Nikola Tesla. This contained the following note:

I could write a novel on the differences between Tesla and Edison, but seeing as how this comic is already huge I decided to leave many things out. For instance, Edison killed cats and dogs [to show AC was unsafe], but Tesla loved animals and had a cat as a child.

So imagine this blogger’s surprise when, following references from the Wikipedia article on ball lightning, he came across several online copies ([1][2]) of an early edition of Pearson’s magazine, where the journalist writes about his tour around Tesla’s factory. Tesla was indeed a smart man, and the tour begins with him doing a series of impressive tricks, such as conjuring up a ball of flame out of nowhere, rolling it along his body with no side effects, and then putting it inside a box. However, another part of the show is the following:

Some animal is now brought out from a cage, it is tied to a platform, an electric current is applied to its body and in a second the animal is dead. The tall young man calls your attention to the fact that the indicator registers only one thousand volts, and the dead animal being removed, he jumps upon the platform himself, and his assistants apply the same current to the dismay of the spectators.

One cannot be sure what definition of “animal lover” the author of the Oatmeal goes by. But it must be a warped one if it doesn’t include Edison for killing animals to make a point about safety, but does include Tesla for killing animals to show off to journalists. Perhaps it is excusable if you are a crazy cat lord.

# Principal Components Analysis, Eigenfaces, and Karhunen-Loève Expansions: Part One

Last time, we were talking about interpolation and orthogonal polynomials. For this series, we’re also going to end up talking about orthogonal vectors and orthogonal functions. But first, we’ll look at some old photos from the Seventies.

Principal Components Analysis
The Budapest National Gallery has its floors arranged in chronological order, so the first floor is given over to an overwhelming sea of medieval paintings of Bible stories and prophets. Well-painted as they are, when the founder of a country is a King Saint, you know you’re in for the long haul on this floor. However, the visitor who hasn’t run out of patience before they reach the floors for the 20th Century will find a cute pair of photographs by Péter Türk.

One is a collation of shots of Türk’s class. For the second, he chopped each of the shots into a grid of squares. The top-left squares from each shot have then been placed together in one large square. The squares of the next grid place along have then been placed alongside, and so on. The result is something that, from a distance, looks like a new face.

Class Average I, Péter Türk, 1979.

Class Average II, Péter Türk, 1979.

Today, this would be done by computer — a mechanical Türk, so to speak — and the squares would be blended together instead of being placed in blocks, but what we have here is some idea of a “mean face”, the average of all the individual photos. It’s not exactly like any of the individual photos, but clearly shows a lot of their shared features. Especially the hair. If we let the computer take the mean over each pixel position, rather than placing them next to each other, we obtain something that looks surprisingly human.

The true mean face, or as close as we can get with low-quality online scans of the originals.

Class Average II rescaled, for comparison.

It’s worth mentioning — not for the last time — how good a job Türk did at keeping his photographs consistent, with regards to face positioning. This is a small group of people, yes, but if we take the eyes as an example, the eye positions are so similar that the mean face still has pupils!

This tinkering is well and good, but consider what we might want to do with a large collection of such photos.
1. Say we wanted to make a digital database with thousands of these photos. Maybe we want to set up some sort of photographic identification system, where we compare a new photo, of someone requesting access to something, against the photos we have on file. How would we decide if the new photo was similar enough to one on the database to grant them access?
2. Along similar lines, suppose we’re not interested specifically in the people in the photos, but we are interested in finding some rules of thumb for telling between different people in the same population. How would we do so?
3. Now suppose we’d also like to compress the photos somehow, to save on memory. How should we do so while keeping the photos as distinguishable as possible?

One method we can use for this is Principal Components Analysis, which I’ll be talking about over the next few posts. However, here’s a brief taste of what it allows us to do, statistically rather than by guesswork:
1. PCA gives us a way to take a new photo, and make some measure of its “distance” from our originals. We can then decide that it’s a photo of the person in the closest original, or that it’s a new person if all the “distances” are too large.
2. The most important features for distinguishing between people in the above set of photos are the side of their face the light source is on, and how far down their fringe comes.
3. We can compress the photos in such a way that we know how much of the original variance, or distinctiveness between the photos, that we keep. If we don’t mind compressing by different amounts for different photos, we could also keep a set amount of distinctiveness for each photo, rather than across the whole group.
4. We can try — with variable levels of success — to generate new faces from scratch.

Also worth noting is that, apart from 4., all of this can be done with only a covariance estimate: we make no assumptions about the distribution the photos are drawn from.

We’ll come back to these photos, and these applications, later in the series. Next time, we’ll look at something a bit simpler first.

Last time I talked about quadratures and orthogonal polynomials, so now it’s time to combine the two.

Theorem: A quadrature scheme with $n$ interpolation points will be exact for polynomials of order up to $2n-1$ if the quadrature points are chosen to be the roots of the associated $n$th-order orthogonal polynomial.

Note that, for an arbitrary choice of sample points, we would generally expect $n$ points to give a quadrature for polynomials of order $n-1,$ so $2n-1$ is a substantial improvement. For example, a single interpolation point can exactly integrate over polynomials of order one rather than just zero — as we had in our original example — and two interpolation points can integrate over polynomials of order three rather than one, even better than our original example. Three points can integrate up to order five!

Proof: We can split any $(2n-1)$-order polynomial $P_{2n-1}(x)$ according to the orthogonal polynomial $f_n(x).$ In particular, we can use the decomposition

$P_{2n-1}(x) =P_{n-1}(x)f_n(x)+Q_{n-1}(x),$

where $P_{n-1}$ and $Q_{n-1}$ are polynomials of order $n-1.$ Since $f_n$ is predetermined, we must specify a $(2n-1)$-order polynomial, with $2n$ coefficients, by choosing the polynomials $P_{n-1}$ and $Q_{n-1},$ with $2n$ coefficients between them. We have as many free coefficients to choose as we need, so this decomposition is always possible.

When we integrate over this expression, we see that

$\int_a^b P_{2n-1}(x)\, \textrm{d}x =\int_a^b (P_{n-1}(x) f_n(x)+Q_{n-1}(x) ) \,\textrm{d}x =\int_a^b Q_{n-1}(x) \,\textrm{d}x ,$

by $f_n(x)$ being orthogonal to any $P_{n-1},$ regardless of our choice of integration points. We therefore have the equality

$\int_a^b P_{2n-1}(x)\, \textrm{d}x =\int_a^b Q_{n-1}(x) \,\textrm{d}x = \sum_{k=1}^n \omega_k Q_{n-1}(x_k) ,$

so we must show that $\sum_{k=1}^n \omega_k P_{2n-1}(x_k)$ is equal to any one of these expressions. But, if we take $x_k$ to be the roots of $f_n(x) ,$ then equality to the third expression follows immediately, since the first term in the decomposition of $P_{2n-1}(x_k)$ disappears for every $x_k,$ and so $P_{2n-1}(x_k) = Q_{n-1}(x_k). \, \square$

We can, therefore, choose our interpolation points to be the roots of an orthogonal polynomial to optimise our quadrature method, with respect to the order of polynomials whose integrals are computed exactly. For our case over the interval $[-1,1],$ this gives the first optimal sets of interpolation points as

$\left\{0\right\}, \, \left\{-\frac{1}{\sqrt{3}} ,\frac{1}{\sqrt{3}}\right\}, \, \left\{-\sqrt{\frac{3}{5}} ,0,\sqrt{\frac{3}{5}} \right\}, \ldots$

As a final extension, suppose the function we are integrating is not polynomial, but it is a polynomial times some other function — $w(x),$ say — that is explicitly known. In this case, we wish to estimate

$\int_a^b w(x)f(x)\, \textrm{d}x.$

For this case, we can proceed as before, except that now we choose polynomials $f_i(x)$ that are orthogonal, in the sense that

$\langle f_i,f_j\rangle = \int_a^b w(x)f_i(x)f_j(x)\, \textrm{d}x = 0.$

In other words, the known function $w(x)$ becomes a weight function inside the integral, changing our choice of orthogonal polynomials, and of interpolation points. Common examples on the space $[0,1]$ are the Jacobi polynomials for $w(x) = (1-x)^{\alpha}(1+x)^{\beta},$ and the Chebyshev polynomials of first type for $w(x) = 1/\sqrt{1-x^2}.$

That’s about all I have to say. Gaussian quadratures are a reasonably straightforward step-up from basic quadratures, but give a sizeable increase in efficiency if we’re dealing with some function of known type, with an unknown polynomial term.

It’s been about five months since I said this post was in draft, so it’s about time I reined in my perfectionism and published the damn thing.

Since this is a graduate student blog at the moment, it seems reasonable I should write a bit more about what I’m learning at any particular time. Late last year, our department had a talk by Erik van Doorn, from the University of Twente, which looked at birth-death processes, and asymptotic results for how quickly they can be expected to converge to their equilibrium distribution. It was an overview of his paper Representations for the decay parameter of a birth-death process based on the Courant-Fischer theorem, to appear in the Journal of Applied Probability.

A good deal of the talk introduced and used basic results on orthogonal polynomials, so I went to see if any of my books mentioned the subject. It turned out there was a chapter on them in Approximation Theory and Methods by Michael Powell – a book that’s been on my bookshelf for about five years but hardly been read – regarding their use in Gaussian quadratures. The following is mostly spliced together from that chapter, and my undergraduate notes on Numerical Analysis.

Interpolation

Before we talk about quadratures, it’s best if we start with interpolation. Say we have some function over some interval, where we can take a few sample values, with no measurement error, but we have no explicit formula and can’t afford to sample it everywhere. We thus would like to use our sample values to fit an approximating function to the whole interval. One simple way to do this is to try to fit a polynomial through the sample points. We can do this by assigning each sample point a Lagrange polynomial

$l_k(x) = \prod_{n \neq k} \frac{x-x_n} {x_k-x_n} ,$

with value one at that sample point and zero at all the others. For example, if we take our sample points at -1,-0.5,0,0.5, and 1, then the Lagrange polynomials are those in the plot below. There’s a light grey line at one to help check they are equal to one or zero at the sample points.

Our fitted curve will then just be a sum of these Lagrange polynomials, multiplied by their corresponding sample value, so we get a polynomial passing through all the sample points, and estimate the function $f(x)$ as

$\hat{f}(x) = \sum_k f(x_k) l_k(x) .$

This gives a curve that passes through all the interpolation points with the smallest-order polynomial possible. It works well for estimating functions that are, indeed, polynomials, but for other functions it can run into problems. In particular, there are cases where the difference between the interpolation curve and the true function at certain points increases when we increase the number of sample points, so we can’t necessarily improve the approximation by adding points. There’s also the question of where to sample the original function, if we have control over that. I’ll pass over these issues, and move on to integration.

Now say that, instead of approximating a function with some samples, we want to approximate a function’s integral by sampling its value at a few points, or

$\int_a^b f(x) \, \textrm{d} x \simeq \omega_0f(x_0) +\omega_1f(x_1) +\ldots +\omega_nf(x_n) .$

If we want to focus on making the integration accurate when $f$ is a low-order polynomial, the quadrature with $n+1$ sample points is exact for polynomials up to order $n$ if we set the weights as

$\omega_k = \int_a^b l_k(x) \, \textrm{d} x.$

In other words, a quadrature is equivalent to fitting an interpolation curve, and integrating over it. For example, if we’re integrating a function over the interval $[-1,1] ,$ we could simply take one sample, with weight one. This would give a quadrature of $2f(x_0) ,$ which is exact for any zero-order, constant function, regardless of the position of $x_0.$

We could take samples at the endpoints, to get the quadrature $\frac{1} {2} f(-1) +\frac{1} {2} f(1) ,$ and we can set $f(x)$ to be constant, or proportional to $x,$ to see the result for first-order polynomials is exact.

We could also take the endpoints and the midpoint. Then we have $\frac{1} {3} f(-1) +\frac{4} {3} f(0) +\frac{1} {3} f(1) ,$ which is exact for polynomials up to order two.

However, occasionally we stumble onto a quadrature that does a little better than expected. For the first quadrature above, since our interval is symmetric around zero, if we let $x_0=0$ any first-order term will be antisymmetric around this midpoint, so this quadrature is exact for first-order polynomials too. Similarly, the second quadrature is exact for quadratic terms, but the third quadrature can still only deal with quadratics, and can’t handle cubics.

Considering what happened when we placed the sample points for the first quadrature at zero, we might guess this is something to do with where we place our sample points. If so, how should we place our sample points, and what’s the highest-order function we can exactly integrate with any set number of samples? To answer this, we can use orthogonal polynomials.

Orthogonal polynomials

We say two vectors are orthogonal when their inner product is equal to zero. For example, if the inner product is simply the dot product, then

$\langle x,y \rangle = \sum_{k=1}^n x_ky_k =0,$

and so vectors are orthogonal if they are perpendicular to each other.

We have a similar definition and example for orthogonal polynomials, but now we choose an inner product that integrates over an interval instead of summing over two vectors’ dimensions:

$\langle f,g \rangle = \int_a^b f(x)g(x)\,\textrm{d} x =0.$

We can then choose a sequence of polynomials with increasing order that are all orthogonal to each other. For example, we can start the sequence with $f_0(x)=1,$ or some multiple of it. We then seek a first-order polynomial $f_1(x)=Ax+B$ such that

$\int_a^b 1\times(Ax+B) \,\textrm{d} x =\frac{A} {2} (b^2-a^2) +B(b-a) =0.$

This can be any multiple of $x-(b+a) /2 .$ In many cases we wish the orthogonal polynomials to have be orthonormal, i.e. $\langle f_k,f_k\rangle =1,$ so for the above we require

$\int_a^b C_0^2 \,\textrm{d}x = C_0^2 (b-a) = 1,$

\begin{aligned} \int_a^b C_1^2(x-(b+a)/2)^2 \,\textrm{d}x &= C_1^2 \frac{1}{3} \left[\left(b-\frac{b+a}{2}\right)^3 -\left(a-\frac{b+a}{2}\right)^3\right] \\&= C_1^2 \frac{2} {3} \left(\frac{b-a} {2}\right)^3 \\&=1,\end{aligned}

and so on, giving a condition for the value of each scaling factor $C_k.$ We can then find the next term by looking for a second-order polynomial that is orthogonal to $1$ and $x,$ and so on. In the case where $a=-1$ and $b=1$ this gives a simple sequence of polynomials that begins with

$f_0(x)=\frac{1}{\sqrt{2}} ,\, f_1(x)=\sqrt{\frac{3}{2}} x,$

$f_2(x)=\sqrt{\frac{5}{2}} (\frac{3}{2} x^2-\frac{1}{2}) ,\, f_3(x) =\sqrt{\frac{7}{2}} (\frac{5}{2}x^3-\frac{3}{2}x),\ldots$

This is an orthonormal version of the Legendre polynomials.

Since any polynomial can then be expressed as a linear combination of members of this sequence, each polynomial in the sequence is also orthogonal to any polynomial with lower order. So, for example, $f_3$ is orthogonal to all constant, linear, and quadratic polynomials.

To be continued
The next post will explain why these orthogonal polynomials help us decide on interpolation points.

# In Draft This Week

It’s been a while since I’ve posted anything, so here are a few quick notes about things I’ve been learning and working on. I’ll probably write about these fairly soon.

1. Orthogonal polynomials are helpful when looking at Gaussian quadratures, an elegant extension of the standard quadratures I was taught. Not quite statistics, but approximation theory is close enough.

2. Karhunen-Loeve expansions, which can be thought of as the equivalent of Principal Component Analysis for continuous stochastic processes. Instead of finding the eigenvectors of a covariance matrix, in order to put data along independent axes ordered by the amount of variance they account for, you’re finding the eigenfunctions of an autocovariance function, with independent random weights. This leads to slightly strange formulations, where Brownian motion, a process with no derivatives, can be written as an infinite sum of sine functions, that has infinitely-many derivatives but almost surely converges to the Brownian motion everywhere.

3. I’ve tried adding another distributions to the R script for plotting ABC posteriors. Specifically, I’ve added the case with a Possion likelihood, since this is the only other one that can be done without adding distributions from other packages. The sliders I’m currently using make moving around discrete distributions rather odd to work with, though, so a new version’s on hold until I get around to adding options for other continuous cases.

4. The series on our paper from last year hit a roadblock when I started fussing over how much detail to give concerning asymptotics and Taylor’s theorem. The answer for the latter is probably a lot less than I’m trying to do, so once I’ve done a few quick posts on the topics above, I’ll get back to finishing it off.

# R Script: Interactive Plot for ABC Posterior Against True Posterior

The density plot I gave when describing different sources of error in the variance-bias decomposition post was made up, with two arbitrary normal densities plotted together. So, it would be nice if I could give a real example.

For the sake of simplicity, say we have one parameter with a simple-normal distribution, and we have one observation that’s simple-normally distributed around the parameter. The true posterior is then also a normal distribution. However, since the ABC posterior is conditional on the observation in a certain ball, we have an expression in terms of the difference between two distribution functions. The resulting curve is almost normal, but not quite.

This isn’t too hard to plot for some specific values, but really what we’d like is a way to change values and see how the plot changes. Ideally I’d like to find a way to get this into a PDF, so I can use it during presentations, but at the moment I’ve got it running in an R script.

Code is below. It’s nothing too fancy, and was mostly a matter of fooling around with the tcltk package demo code to see what happened. Eventually I’d like to add:

-Sliders for prior and data variance
-Shifting the plot axes as the variables change
-A way to embed this in a PDF, if that’s possible
-Exception handling for zero tolerance

cdfs <- function(x.star,delta,cmean,cvar) {
#cdf of x*+delta-cmean/sqrt(cvar) minus that of x*-delta-cmean/sqrt(cvar)
pnorm(x.star,
mean=cmean-delta,
sd=sqrt(cvar) ) -pnorm(x.star,
mean=cmean+delta,
sd=sqrt(cvar) )
}

dabcpost <- function(x,x.star,pmean,pvar,datvar,delta) {
dnorm(x,mean=pmean,sd=sqrt(pvar) ) *
cdfs(x.star,delta,x,datvar) /cdfs(x.star,delta,pmean,pvar+datvar)
}

ABCplot <- function(x.star,delta,xlim,ylim) {
curve(dnorm(x,mean=x.star/2,sd=1/sqrt(2) ) ,
lty="dashed",xlab=expression(theta) ,
ylab="Density",main="ABC Posterior Error",
xlim=xlim,ylim=ylim)
curve(dabcpost(x,x.star,pmean=0,pvar=1,datvar=1,
delta=delta) ,
legend(x="topright",c("Prior","True posterior","ABC posterior") ,
lty=c("dotted","dashed","solid") )
}

require(tcltk) || stop("tcltk support is absent")
require(graphics); require(stats)
local({
have_ttk <- as.character(tcl("info", "tclversion")) >= "8.5"
if(have_ttk) {
tkbutton <- ttkbutton
tkframe <- ttkframe
tklabel <- ttklabel
}

xlim <- c(-5,5)
ylim <- c(0,0.6)
x.star <- tclVar(3)
x.star.sav <- 3
bw <- tclVar(1)
bw.sav <- 1 # in case replot.maybe is called too early

replot <- function(...) {
bw.sav <<- b <- as.numeric(tclObj(bw))
x.star.sav <<- xs <- as.numeric(tclObj(x.star))
eval(substitute(ABCplot(xs,b,xlim,ylim)))
}

replot.maybe <- function(...)
{
if (as.numeric(tclObj(bw)) != bw.sav ||
as.numeric(tclObj(x.star)) != x.star.sav) replot()
}

regen <- function(...) {
xlim <<- c(min(0,as.numeric(tclObj(x.star) ) /2) -5,
max(0,as.numeric(tclObj(x.star) ) /2) +5)
replot()
}

grDevices::devAskNewPage(FALSE) # override setting in demo()
tclServiceMode(FALSE)
base <- tktoplevel()
tkwm.title(base, "Density")

spec.frm <- tkframe(base,borderwidth=2)
right.frm <- tkframe(spec.frm)

frame3 <-tkframe(right.frm,relief="groove",borderwidth=2)
tkpack(tklabel(frame3,text="Observation") )
tkpack(tkscale(frame3,command=replot.maybe,from=1,to=10,
showvalue=T,variable=x.star,
resolution=0.1,orient="horiz") )

frame4 <-tkframe(right.frm, relief="groove", borderwidth=2)
tkpack(tklabel (frame4, text="Tolerance"))
tkpack(tkscale(frame4, command=replot.maybe, from=0.05, to=16.00,
showvalue=T, variable=bw,
resolution=0.05, orient="horiz"))

tkpack(frame3,frame4, fill="x")
tkpack(right.frm,side="left", anchor="n")

## Bottom frame' (on base):
q.but <- tkbutton(base,text="Quit",
command=function() tkdestroy(base))

tkpack(spec.frm, q.but)
tclServiceMode(TRUE)

regen()
})
`

# Approximate Bayesian Computation: Variance-Bias Decomposition

Now I’ve rambled about how to measure error, let’s relate it back to ABC. I mentioned previously that using ABC with a non-zero tolerance $\delta$ means our samples are taken from the density $p(\theta \,|\, \|S-s^*\| \leq \delta)$, instead of the true posterior $p(\theta \,|\, S=s^*)$ for a sufficient statistic S.

Say we write our estimate as $\hat{\theta} =\frac{1}{n} \sum_{i=1}^n \phi_i$, where each $\phi_i$ is an accepted sample. If we measure error as mean square error, then we can decompose the error as we did in the case of sampling from the wrong distribution:

$\mathbb{E}(L(\theta,\hat{\theta} ) \,|\, x^*) =\underbrace{\mathrm{Var} (\theta\,|\,x^*)}_{\textrm{True uncertainty} } +\underbrace{\frac{1}{n} \mathrm{Var} (\phi \,|\, x^*) }_{\textrm{Monte Carlo error} } +\underbrace{\mathbb{E} ((\mathbb{E} (\phi) -\mathbb{E} (\theta) )^2 \,|\, x^*) }_{\textrm{Square sampling bias} } .$

This is now conditional on the observed data, but this only changes the equation in the obvious way. For a graphical example, say the true posterior, and the ABC posterior our samples come from, look like this:

The true posterior density is, of course, a density with a non-zero variance rather than a single point. This describes the true uncertainty, i.e. what our estimate’s mean square error would be if our estimate was the optimal value $\mathbb{E} (\theta \,|\, S=s^*)$.

Next, imagine we could somehow calculate the ABC posterior, and so get its expectation $\mathbb{E} (\theta \,|\, \|S-s^*\| \leq \delta)$. Since the two expectations – the peaks, in the case shown in the picture above – are likely to not overlap, this estimate would have a slight bias. This introduces a sampling bias.

Finally, take the full case where we average over n samples from the ABC posterior. This now introduces the Monte Carlo error, since sampling like this will introduce more error due to the randomness involved. Note that $\mathrm{Var} (\phi \,|\, x^*) =\mathrm{Var} (\theta \,|\, \|S-s^*\| \leq\delta)$ will probably be larger than $\mathrm{Var} (\theta \,|\, x^*) =\mathrm{Var} (\theta \,|\, S=s^*)$, since $\|S-s^*\| \leq \delta$ provides less information than $S=s^*$.

A Quick Look at the Bias
Since the true uncertainty is not affected by our choice of $\delta$, I’m going to ignore it. In the paper, we never mention it, defining the MSE to be $\mathbb{E} ((\hat{\theta} -\mathbb{E} (\theta \,|\, S=s^*) )^2 \,|\, x^*)$, the sum of the other two error terms above.

We then have variance and square-bias terms, that we can consider separately. The bias is easier, so let’s start with that. First, note that the bias doesn’t depend on the number of samples we take, so we only need to calculate the bias of a single sample $\phi$. After a bit of thought, and denoting the acceptance region as the ball $B_{\delta} (s^*)$ and the prior total density for $\theta$ and $S$ as $p(\cdot,\cdot)$, we can write the bias as

$\mathbb{E} (\phi \,|\, s^*) -\mathbb{E} (\theta \,|\, s^*) =\dfrac{\iint_{s\in B_{\delta} (s^*) } t \, p(t,s) \, \textrm{d}s \, \textrm{d}t} {\iint_{s\in B_{\delta} (s^*) } p(t,s) \, \textrm{d}s \, \textrm{d}t} -\dfrac{\int t \, p(t,s^*) \, \textrm{d}t} {\int p(t,s^*) \, \textrm{d}t} .$

Unless we look at specific cases for the form of $(t,s)$, this is about as far as we can get exactly. To get any further, we need to work in terms of asymptotic behaviour, which I’ll introduce next time.

# Variance-Bias, or The Decomposition Trick for Quadratic Loss

Say we’ve decided to judge our estimator $\hat{\theta}$ for some parameter $\theta$ by determining the mean square error $\mathbb{E} \left((\theta-\hat{\theta} )^2 \right) ,$ i.e. we are using a quadratic loss function. The nice thing about using mean square error, or MSE, to determine optimality of an estimator is that it lends itself well to being split into different components.

Variance and Bias
For example, we can expand the MSE as

$\mathbb{E} \left(L(\theta,\hat{\theta} ) \right) =\mathbb{E} \left((\theta-\hat{\theta} )^2 \right) =\mathbb{E} \left((\theta-\mathbb{E} (\theta) +\mathbb{E} (\theta) -\hat{\theta} )^2 \right) .$

Why add more terms? Because it leads to a useful intuition about the nature of the loss. Say we now split the expression into two, each with two terms, i.e.

$\mathbb{E} \left(L(\theta,\hat{\theta} ) \right) =\mathbb{E} \left((\theta-\mathbb{E} (\theta) )^2 +2(\theta-\mathbb{E} (\theta) ) (\mathbb{E} (\theta) -\hat{\theta} ) +(\mathbb{E} (\theta) -\hat{\theta} )^2 \right) .$

Since $\theta$ is the only random variable in the expression, the interaction term in the middle is zero, so the MSE splits into

$\mathbb{E} \left(L(\theta,\hat{\theta} ) \right) =\mathbb{E} \left((\theta -\mathbb{E} (\theta) )^2 \right) +\big(\hat{\theta} -\mathbb{E} (\theta) \big)^2 =\mathrm{Var} (\theta) +\mathrm{bias} (\hat{\theta} )^2 .$

Our expected loss is thus a combination of the uncertainty of our knowledge of $\theta,$ which we cannot do anything about, and the square of the bias of our estimator. Our optimal estimator, the mean, is thus the estimator that makes the bias equal to zero.

The nice thing about having an unbiased estimator like this one is that it is correct on average, i.e. it doesn’t have a tendency to either over- or under-estimate.

Imagine you’re firing a gun at a target. Assume, for the moment, that your aim is perfect! However, you’re testing a new gun whose performance is unknown. If your shots are tightly packed, i.e. have a small spread, then the variance of the shots is small. If they’re sprayed all over the place, the variance is high. If the cluster of shots is off-centre, they’re biased. If they’re on-target, or at least clustered around it, the bias is small, or even zero.

Having a small bias seems like a good thing. In fact, it seems like such a good thing that people often try to get unbiased estimators. This can turn out to be a bad idea, if it increases the variance too much.

Say we are at the firing range again. Suppose you had two guns to test. One has a tight spread, but shots are off-centre. The other’s shots are centred, but they’re scattered all over the place. If we were interested only in being unbiased, the second gun would be deemed superior, but this goes completely against how most people would evaluate the guns’ performances. If we could look at how the gun did, and adjust it for next time, The bias in the first gun can be compensated for by adjusting the sights, but the second gun is barely usable. So, we still need to take account of both variance and bias.

Monte Carlo Error
However, we’re not done yet! Say we don’t know what the expectation of $\theta$ is. Then we need to decide on some other choice of estimate $\hat{\theta} .$ Let’s say, for example, that while we don’t know the expectation, we can draw samples from the whole distribution. How about if we generated a few samples, and took their average as our estimate? Well, this estimator is random, so the MSE is now an expectation over the estimate as well as $\theta$ itself.

However, we can still split the error as we did above. We can even still get rid of the interaction term, since the estimator and the parameter are independent. So, we get

$\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\mathbb{E} ((\hat{\theta} -\mathbb{E} (\theta) )^2 ) .$

Now what? Well, the second term is the expected square difference between something random and something constant, as we originally had in the simple case before. So, let’s try splitting again! Inserting the expectation of the random variable worked well last time, so lets try that.

$\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\mathbb{E} ((\hat{\theta} -\mathbb{E} (\hat{\theta} ) )^2 ) +(\mathbb{E} (\hat{\theta} ) -\mathbb{E} (\theta) )^2 .$

We get a variance term and a bias term again, fancy that. So, what is $\mathbb{E} (\hat{\theta} ) ?$ Well, it’s the expectation for an average of independent samples, so it’s equal to the expectation for one of them, which is just $\mathbb{E} (\theta) .$ The bias term disappears.

Similarly, the variance of an average is the variance of a sample, over the number of samples. So, if we write the estimator as $\hat{\theta} =\frac{1} {n} \sum_{i=1}^n \phi_n,$ the MSE is

$\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\frac{1} {n} \mathrm{Var} (\phi) .$

So we get closer to the optimal MSE as we take more samples. Makes sense. There are also variations used to reduce the MC error, such as using non-independent samples, but I’ll leave off for now.

Sampling from the Wrong Distribution
We’re still not done. Say that the sampling estimator we used above is taking samples from the wrong distribution. How does this affect the error? Well, the variance of each sample might change, but, more importantly, the bias term probably won’t disappear:

$\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\frac{1} {n} \mathrm{Var} (\phi) +(\mathbb{E} (\phi) -\mathbb{E} (\theta) )^2 .$

One thing to note from this is that if we sample from a distribution with the same expectation, but with lower variance, we get a smaller MSE. The logical extreme is taking a distribution with zero variance. Then every sample is equal to the expectation, and we are just left with the natural parameter uncertainty.

So, we now have three different sources of error. One is the inherent uncertainty of what we’re trying to estimate. Another is Monte Carlo error, introduced by averaging over samples instead of using the expectation directly. Finally, there is sampling bias, introduced by taking our samples from a distribution different to the one we want.

That’s about as far as we can go for this example, but this technique can also be used for other problems. Just try the same tactic of splitting the MSE into independent sources of error, by adding and subtracting a term in the middle. Then we can find what the different sources of error are, which we have control over, and so on.

The good news, though, is that the above is all we need to talk about the error introduced by using ABC, so I’ll get back to that next time.