# Approximate Bayesian Computation: Variance-Bias Decomposition

Now I’ve rambled about how to measure error, let’s relate it back to ABC. I mentioned previously that using ABC with a non-zero tolerance $\delta$ means our samples are taken from the density $p(\theta \,|\, \|S-s^*\| \leq \delta)$, instead of the true posterior $p(\theta \,|\, S=s^*)$ for a sufficient statistic S.

Say we write our estimate as $\hat{\theta} =\frac{1}{n} \sum_{i=1}^n \phi_i$, where each $\phi_i$ is an accepted sample. If we measure error as mean square error, then we can decompose the error as we did in the case of sampling from the wrong distribution:

$\mathbb{E}(L(\theta,\hat{\theta} ) \,|\, x^*) =\underbrace{\mathrm{Var} (\theta\,|\,x^*)}_{\textrm{True uncertainty} } +\underbrace{\frac{1}{n} \mathrm{Var} (\phi \,|\, x^*) }_{\textrm{Monte Carlo error} } +\underbrace{\mathbb{E} ((\mathbb{E} (\phi) -\mathbb{E} (\theta) )^2 \,|\, x^*) }_{\textrm{Square sampling bias} } .$

This is now conditional on the observed data, but this only changes the equation in the obvious way. For a graphical example, say the true posterior, and the ABC posterior our samples come from, look like this:

The true posterior density is, of course, a density with a non-zero variance rather than a single point. This describes the true uncertainty, i.e. what our estimate’s mean square error would be if our estimate was the optimal value $\mathbb{E} (\theta \,|\, S=s^*)$.

Next, imagine we could somehow calculate the ABC posterior, and so get its expectation $\mathbb{E} (\theta \,|\, \|S-s^*\| \leq \delta)$. Since the two expectations – the peaks, in the case shown in the picture above – are likely to not overlap, this estimate would have a slight bias. This introduces a sampling bias.

Finally, take the full case where we average over n samples from the ABC posterior. This now introduces the Monte Carlo error, since sampling like this will introduce more error due to the randomness involved. Note that $\mathrm{Var} (\phi \,|\, x^*) =\mathrm{Var} (\theta \,|\, \|S-s^*\| \leq\delta)$ will probably be larger than $\mathrm{Var} (\theta \,|\, x^*) =\mathrm{Var} (\theta \,|\, S=s^*)$, since $\|S-s^*\| \leq \delta$ provides less information than $S=s^*$.

A Quick Look at the Bias
Since the true uncertainty is not affected by our choice of $\delta$, I’m going to ignore it. In the paper, we never mention it, defining the MSE to be $\mathbb{E} ((\hat{\theta} -\mathbb{E} (\theta \,|\, S=s^*) )^2 \,|\, x^*)$, the sum of the other two error terms above.

We then have variance and square-bias terms, that we can consider separately. The bias is easier, so let’s start with that. First, note that the bias doesn’t depend on the number of samples we take, so we only need to calculate the bias of a single sample $\phi$. After a bit of thought, and denoting the acceptance region as the ball $B_{\delta} (s^*)$ and the prior total density for $\theta$ and $S$ as $p(\cdot,\cdot)$, we can write the bias as

$\mathbb{E} (\phi \,|\, s^*) -\mathbb{E} (\theta \,|\, s^*) =\dfrac{\iint_{s\in B_{\delta} (s^*) } t \, p(t,s) \, \textrm{d}s \, \textrm{d}t} {\iint_{s\in B_{\delta} (s^*) } p(t,s) \, \textrm{d}s \, \textrm{d}t} -\dfrac{\int t \, p(t,s^*) \, \textrm{d}t} {\int p(t,s^*) \, \textrm{d}t} .$

Unless we look at specific cases for the form of $(t,s)$, this is about as far as we can get exactly. To get any further, we need to work in terms of asymptotic behaviour, which I’ll introduce next time.