# Measuring Error: What Are Loss Functions?

Last time I finished on the question of how we measure the error of an estimate. Let’s say we trying to estimate a parameter, whose true value is $\theta$, and our estimate is $\hat{\theta}$. If there were to be a difference between the two, how much would we regret it? We’d like some way to quantify the graveness of the error in our estimate. Specifically, we’d like to create some loss function $L(\theta,\hat{\theta})$. We could then determine how good an estimator is by calculating the resulting loss: a better estimator would have less loss, so the smaller the value of $L(\theta,\hat{\theta} )$ the better.

Now, there are some situations where our choice of loss function is obvious. An example would be if we’re selling a certain good, and we’d like to know how many of them to order in. We are then estimating the number of orders we’ll get before the next opportunity to restock. The loss function is then either proportional to the number of unfulfilled orders, if we understock, and the cost of storing the surplus, if we overstock.

In the more abstract case, where we’re estimating a parameter we will never observe, the choice of loss function isn’t as obvious. We’re not exactly charged money for making an inaccurate model. Instead, I’m going to suggest some properties we might want for the loss function, and then give a few examples.

If our estimate is exactly correct, obviously we wouldn’t regret it at all. In other words,

$L(\theta,\theta) =0 .$

Next, we’ll make some statements about symmetry, i.e. that we only care about the distance between the estimator and the true value, and not about the direction.

Say the empty circle in the middle of this number line is the true value. I propose that one property we’d like for our loss function is that the loss of the estimators at the two filled circles is the same, and that the loss of the estimators at the two empty squares is the same.

This is not a required property, and may not be desirable, depending on the problem. For instance, in the goods restocking example I mentioned above, the penalty for underestimating is often not the same as overestimating. One loses business, one just requires paying for longer storage for the surplus. Still, for the purposes of estimating some abstract model parameter on an arbitrary scale, I’d say assuming symmetry of loss is a reasonable property to assume.

I’d also say we’d like to depend on the distance, but not on the values, so the loss is some function of $\theta-\hat{\theta}$. Think of the loss function like a generalised voltmeter: it can measure the difference between a pair of points, but a single point has no meaning.

How about if we make two different estimates, and one is further from the truth? We’d want to penalise it at least as much as the other. In other words, if we have two estimates $\hat{\theta}_1$ and $\hat{\theta}_2$, and the distance $|\theta-\hat{\theta}_1|$ of the true value from the first estimator is smaller than the distance $|\theta-\hat{\theta}_2|$ from the second estimator, we’d like

$L(\theta,\hat{\theta} _1) \leq L(\theta,\hat{\theta} _2) .$

Of course, in practice we don’t know what $\theta$ is, so we try to minimise our expected loss $\mathbb{E} (L(\theta,\hat{\theta} ) )$. Usually we’d be minimising this expected loss based on some observations, but I’m keeping that out of the notation here for simplicity. Just assume the distribution we have on the parameter uses all our usable knowledge.

These properties leave a lot of options. Here are some of the more common ones.

0-1 Loss
Here the loss is simply equal to 1 if the estimator is different from the truth, and 0 if it’s not. This is pretty hard-line as loss functions go, because it considers being wrong to be so heinous that it makes no differentiation between different amounts of wrongness. Our expected loss is then simply the probability $\mathbb{P} (\theta\neq\hat{\theta} )$ of being wrong. Our optimal choice of estimator is then simply the most likely value of $\theta$. In other words, the mode is the optimal estimator for 0-1 loss.

There is also a similar case where the loss is 0 in a small region around the truth, and 1 outside it. The optimal estimator is determined by finding the point with the most chance of the truth being nearby, i.e. the middle of a highest-density region.

Absolute Difference
Here we take the loss function

$L(\hat{\theta} ,\theta) =|\theta-\hat{\theta} | .$

The seriousness of an error is thus proportional to the size of the error. In this case, the optimal estimator is the median.

In the case of there being several parameters, the median is also the optimal estimator when the expected loss is the expected Manhattan distance from the truth, i.e. the sum of the absolute differences for each parameter.

This is the most common loss function. For true value $\theta$ and estimate $\hat{\theta}$, the loss is

$L(\hat{\theta} ,\theta) =(\theta-\hat{\theta} )^2 .$

Large errors are considered far more serious here than in the case of absolute difference. This may, or may not, be a good idea. More on that in a minute. The expected loss, also called the mean square error, can be expanded as

$\mathbb{E} (L(\hat{\theta} ,\theta) )=\mathbb{E} (\theta^2-2\hat{\theta} \theta+\hat{\theta}^2 ) =\mathbb{E} (\theta^2) -2\hat{\theta} \mathbb{E} (\theta) +\hat{\theta}^2 .$

We want to choose our estimator $\hat{\theta}$ to minimise this expected loss. This is easily achieved by $\hat{\theta} =\mathbb{E} (\theta) .$ In other words, the (arithmetic) mean is the optimal estimator for quadratic loss.

In the case of several parameters, the mean is also the optimal estimator when the expected loss is the expected Euclidean distance from the truth.

This is the loss function I’ll be using from hereon. A few more comments before I finish.

Note that we have the “Big Three” of averages as the optimal estimators for the loss functions given above. The mode isn’t used that much, but absolute and quadratic loss can be useful for intuition about the difference between the mean and the median. Specifically, the median is less influenced by outliers than the mean. That can be important, because you might not want the outliers to count for much, especially if they’re suspected to be due to some observational error. This answer on Cross Validated addresses a good example.

We should also consider what we’re doing by choosing a loss function.

The obvious issue is that we’re making point estimates of a parameter, rather than making distributions or making predictions about future observables. I’ve briefly mentioned this before.

The other issue is that choosing a loss function can be subjective, to put it mildly. I suspect the main reason that the quadratic loss is the most common loss function is simply because means are easier to calculate, and it has nice properties in general. The same thing goes for how we decide what the optimal estimator is. I was describing the optimality of loss functions in terms of minimising the expected loss, i.e. the mean loss. But if we think absolute error is the better loss function, why would we would to think in terms of mean loss in the first place, rather than median loss? There is theory out there that considers the error of point estimates in terms of medians, but I have no experience with it whatsoever. Perhaps another time, this post is long enough already.

For now I’ll follow the idea that the mean is good enough in general. It’s easy, everyone knows how to calculate it, and quadratic loss has nice properties. Next post will look one of them, the variance-bias decomposition. It will also look at what happens when we can’t directly use the mean as our estimator, as is the case in Monte Carlo methods like ABC.