Variance-Bias, or The Decomposition Trick for Quadratic Loss

Say we’ve decided to judge our estimator \hat{\theta} for some parameter \theta by determining the mean square error \mathbb{E} \left((\theta-\hat{\theta} )^2 \right) , i.e. we are using a quadratic loss function. The nice thing about using mean square error, or MSE, to determine optimality of an estimator is that it lends itself well to being split into different components.

Variance and Bias
For example, we can expand the MSE as

\mathbb{E} \left(L(\theta,\hat{\theta} ) \right) =\mathbb{E} \left((\theta-\hat{\theta} )^2 \right) =\mathbb{E} \left((\theta-\mathbb{E} (\theta) +\mathbb{E} (\theta) -\hat{\theta} )^2 \right) .

Why add more terms? Because it leads to a useful intuition about the nature of the loss. Say we now split the expression into two, each with two terms, i.e.

\mathbb{E} \left(L(\theta,\hat{\theta} ) \right) =\mathbb{E} \left((\theta-\mathbb{E} (\theta) )^2 +2(\theta-\mathbb{E} (\theta) ) (\mathbb{E} (\theta) -\hat{\theta} ) +(\mathbb{E} (\theta) -\hat{\theta} )^2 \right) .

Since \theta is the only random variable in the expression, the interaction term in the middle is zero, so the MSE splits into

\mathbb{E} \left(L(\theta,\hat{\theta} ) \right) =\mathbb{E} \left((\theta -\mathbb{E} (\theta) )^2 \right) +\big(\hat{\theta} -\mathbb{E} (\theta) \big)^2 =\mathrm{Var} (\theta) +\mathrm{bias} (\hat{\theta} )^2 .

Our expected loss is thus a combination of the uncertainty of our knowledge of \theta, which we cannot do anything about, and the square of the bias of our estimator. Our optimal estimator, the mean, is thus the estimator that makes the bias equal to zero.

The nice thing about having an unbiased estimator like this one is that it is correct on average, i.e. it doesn’t have a tendency to either over- or under-estimate.

Imagine you’re firing a gun at a target. Assume, for the moment, that your aim is perfect! However, you’re testing a new gun whose performance is unknown. If your shots are tightly packed, i.e. have a small spread, then the variance of the shots is small. If they’re sprayed all over the place, the variance is high. If the cluster of shots is off-centre, they’re biased. If they’re on-target, or at least clustered around it, the bias is small, or even zero.

Having a small bias seems like a good thing. In fact, it seems like such a good thing that people often try to get unbiased estimators. This can turn out to be a bad idea, if it increases the variance too much.

Say we are at the firing range again. Suppose you had two guns to test. One has a tight spread, but shots are off-centre. The other’s shots are centred, but they’re scattered all over the place. If we were interested only in being unbiased, the second gun would be deemed superior, but this goes completely against how most people would evaluate the guns’ performances. If we could look at how the gun did, and adjust it for next time, The bias in the first gun can be compensated for by adjusting the sights, but the second gun is barely usable. So, we still need to take account of both variance and bias.

Monte Carlo Error
However, we’re not done yet! Say we don’t know what the expectation of \theta is. Then we need to decide on some other choice of estimate \hat{\theta} . Let’s say, for example, that while we don’t know the expectation, we can draw samples from the whole distribution. How about if we generated a few samples, and took their average as our estimate? Well, this estimator is random, so the MSE is now an expectation over the estimate as well as \theta itself.

However, we can still split the error as we did above. We can even still get rid of the interaction term, since the estimator and the parameter are independent. So, we get

\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\mathbb{E} ((\hat{\theta} -\mathbb{E} (\theta) )^2 ) .

Now what? Well, the second term is the expected square difference between something random and something constant, as we originally had in the simple case before. So, let’s try splitting again! Inserting the expectation of the random variable worked well last time, so lets try that.

\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\mathbb{E} ((\hat{\theta} -\mathbb{E} (\hat{\theta} ) )^2 ) +(\mathbb{E} (\hat{\theta} ) -\mathbb{E} (\theta) )^2 .

We get a variance term and a bias term again, fancy that. So, what is \mathbb{E} (\hat{\theta} ) ? Well, it’s the expectation for an average of independent samples, so it’s equal to the expectation for one of them, which is just \mathbb{E} (\theta) . The bias term disappears.

Similarly, the variance of an average is the variance of a sample, over the number of samples. So, if we write the estimator as \hat{\theta} =\frac{1} {n} \sum_{i=1}^n \phi_n, the MSE is

\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\frac{1} {n} \mathrm{Var} (\phi) .

So we get closer to the optimal MSE as we take more samples. Makes sense. There are also variations used to reduce the MC error, such as using non-independent samples, but I’ll leave off for now.

Sampling from the Wrong Distribution
We’re still not done. Say that the sampling estimator we used above is taking samples from the wrong distribution. How does this affect the error? Well, the variance of each sample might change, but, more importantly, the bias term probably won’t disappear:

\mathbb{E} (L(\theta,\hat{\theta} ) ) =\mathrm{Var} (\theta) +\frac{1} {n} \mathrm{Var} (\phi) +(\mathbb{E} (\phi) -\mathbb{E} (\theta) )^2 .

One thing to note from this is that if we sample from a distribution with the same expectation, but with lower variance, we get a smaller MSE. The logical extreme is taking a distribution with zero variance. Then every sample is equal to the expectation, and we are just left with the natural parameter uncertainty.

So, we now have three different sources of error. One is the inherent uncertainty of what we’re trying to estimate. Another is Monte Carlo error, introduced by averaging over samples instead of using the expectation directly. Finally, there is sampling bias, introduced by taking our samples from a distribution different to the one we want.

That’s about as far as we can go for this example, but this technique can also be used for other problems. Just try the same tactic of splitting the MSE into independent sources of error, by adding and subtracting a term in the middle. Then we can find what the different sources of error are, which we have control over, and so on.

The good news, though, is that the above is all we need to talk about the error introduced by using ABC, so I’ll get back to that next time.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s