Approximate Bayesian Computation: Summary Statistics and Tolerance

Last time I introduced a basic example of the Approximate Bayesian Computation methodology for estimating posterior expectations in the case of unavailable likelihoods. One of the main issues I mentioned was that it could take a long time to accept proposals. This is caused by three factors.

1. The algorithm is being overly picky about its criteria for accepting proposals. In our Lucky Dip example, we saw that proposals were accepted when the generated play record for the players exactly matched the observed record. All we require is that the frequency of each possible outcome is the same between the two records. In other words, the algorithm is worried about the order in which the outcomes occurred when it doesn’t need to.

2. The outcome, even when stripping out the extraneous information mentioned in the point above, is so unlikely that generating a matching dataset will still take a long time. This is especially true if any element of the data is continuous, because the probability of generating an exactly matching dataset will be zero.

3. It takes a long time to generate datasets due to the complexity of the model. There is little we can do about this, outside of making sure our model is no more complicated than is necessary.

We thus need to consider stripping out superfluous information, and accepting proposals whose data is merely close to the observed data, to mitigate factors 1 and 2 respectively.

1. Summary Statistics

If we flip a coin several times to decide whether it’s fair, we obviously don’t care about the order in which the heads and tails appear. We just care about how often they each turn up. We could also just look at the proportion of flips that come up heads, without recording how many flips we observed. These are both summary statistics, that take the original data and express it in a different, usually smaller, form. Formally, we take the data in the form of the random variable X, and calculate the summary statistic S=S(X), which is also a random variable.

Some summaries are better than others. A higher number of flips increases the accuracy of the result, but our second summary above doesn’t keep a record of this. In other words, only recording the proportion of flips that come up heads loses relevant information, so we should be using a summary such as the first one, which does not.

Ideal summary statistics, that don’t lose relevant information, are referred to as sufficient statistics. Formally, the posterior of the parameters given a sufficient statistic is equal to the posterior given the original data, for any possible values of the data and statistic:

p(\theta | S=s^*=S(x^*) ) =p(\theta | X=x^*) \textrm{ for all } \theta,x^*: p(\theta,x^*) >0.

The original data, of course, is itself a sufficient statistic. The best-case choice would be what is called a minimal sufficient statistic, which is a sufficient statistic with the smallest possible number of dimensions.

Minimal sufficient statistics are desirable, because, as the number of dimensions increases, the chance of hitting any region inside the space will generally decrease. This is known as the curse of dimensionality, and is why we often want to decrease the dimensions as much as possible.

Here’s a lazy analogy. Consider a square, that contains a circle touching the sides of the square. Now say we choose a random point inside the square, with each point equally likely. The probability of the point being inside the circle is equal to the proportion of the square’s area it takes up, which is \pi/4.

Now add another dimension, so that we are choosing a point inside a cube, and seeing if it’s inside a sphere. That sphere now takes up less of the available space: the probability of choosing a point inside it is \pi/6. The probability decreases as the number of dimensions increases: for q dimensions the probability is \frac{\pi^{q/2} } {2^q\Gamma(q/2+1) } . Obviously, we’re seldom trying to hit a region so large compared to the entire space, but hopefully you get the idea that more dimensions means less chance of success.

This leads us to the following ABC algorithm.

ABC2
1. Decide on the acceptance number n.
2. Sample a proposal \hat{\theta} from the prior p(\theta|M) .
3. Calculate the observed statistic s^*=S(x^*) .
4. Generate a dataset \hat{x} from the likelihood p(X|M,\hat{\theta} ) .
5. Calculate the statistic \hat{s} =S(\hat{x} ) for the generated dataset.
6. Accept \hat{\theta} if \hat{s} =s^*, else reject.
7. Repeat steps 4-6 until n proposals have been accepted.
8. Estimate the posterior expectation of \theta as \mathbb{E} (\theta|M,X=x^*) \simeq\frac{1} {n} \sum_{k=1}^n \hat{\theta}_k , the mean of the accepted proposals.

Since the proposals are only accepted when the generated statistic is equal to the observed statistic, the accepted proposals are taken exactly from the distribution p(\theta|S=s^*) . This is equal to the posterior distribution p(\theta|X=x^*) if the statistic is sufficient.

Let us consider the Lucky Dip example again. The observed player record is (3,0,0,2,3) , and we can show that counting the frequency of each outcome – in this case (2,0,1,2) – is sufficient. In fact, we can show it’s minimal sufficient. More on this later. For a tank of f fish, the ABC method for approximating the number r of red fish is the following.

ABC2FISH
1. Decide on the acceptance number n.
2. Sample a proposal \hat{r} \in \{0,1,2,\ldots,f\} , with each possibility equally likely.
3. Generate a play record of five players, where each player starts with \hat{r} red fish and f-\hat{r} blue fish in the tank.
4. Count the number of losers, and the number of players that win on the first, second, and third draws, to calculate the generated statistic.
5. Accept \hat{r} if the generated statistic matches the observed statistic (2,0,1,2) , else reject.
6. Repeat steps 2-5 until n proposals have been accepted.
7. Estimate the posterior expectation of r as \mathbb{E} (r|M,X=(3,0,0,2,3)) \simeq\frac{1} {n} \sum_{k=1}^n \hat{r}_k , the mean of the accepted proposals.

This is like the ABC1FISH algorithm we used before for the same problem, but uses a summary statistic with four dimensions instead of the five-dimensional data. That isn’t a great reduction, but consider that, if the play record was longer, the statistic wouldn’t increase in size. For large amounts of observed data, this statistic will thus effect a large reduction in dimensionality.

Again, I took a hundred ABC estimates for each value of n. The results are given below.

abc-luckydip2

Some comments:
1. Stripping away irrelevant information should give similar results in less time. Comparing the above boxplot with the one for ABC1FISH will show the results to be roughly the same. Since we haven’t lost any relevant information, this is to be expected. However, where ABC1FISH took a day or two to run, ABC2FISH took a few hours.

2. It is simple to find a sufficient statistic for such a simple problem. In fact, if the likelihood is known we can also find the minimal sufficient statistic, since this is simply a matter of listing all the expressions in the likelihood in which the data elements appear.

For the example above, the likelihood of a single play, for a given parameter value, depends on the outcome. Thus, the total likelihood of all the plays is the product of the single plays, so will be equal to the product of likelihoods associated with each outcome, each to the power of the number of plays with that outcome:

\mathbb{P}(x_0 \textrm{ losers, } x_1 \textrm{wins on first draw, etc.} | \theta) =\prod_{i=0}^3 \mathbb{P}(\textrm{Result } i | \theta)^{x_i} .

Without the likelihood, this is less simple, and often impossible. There’s a growing amount of research on optimal choice of summary statistic, but the chosen statistic will usually not be sufficient, and the ABC estimate will converge to the wrong value as the accept number increases. However, the increase in acceptance probability is considered worth it, as having a higher acceptance rate allows more accepts – a larger n – for the same running time, which will decrease the error up to a point.

3. If any relevant element of the data is continuous, this still isn’t good enough, because our chance of getting any value for the statistic is still zero. This requires other measures, such as the concept of tolerance introduced below. Given such measures, summary statistics are still useful for the same reasons, so they are worth explaining first.

2. Tolerance

If the data is continuous, the probability of getting an exact match between datasets is zero, so at some point you have to say “close enough”. One way to do this is to decide on some way to measure the distance between two datasets, and accept the proposal if this distance is less than a certain value.

For example, say we define the distance between two datasets as the Euclidean distance, i.e. calculate the difference between each pair of respective elements, and take the square root of the sum of their squares. Then we’d accept any simulated data that lies within a ball around the observed data, with a radius equal to the tolerance. In one dimension this would be a symmetric interval, in two a circle, in three a sphere, and so on. Perhaps this would be a better place to explain the curse of dimensionality in terms of balls in boxes, but no matter.

Introducing tolerance leads to the following algorithm.

ABC3
1. Decide on the acceptance number n, the distance metric \|\cdot\| , and the tolerance \delta.
2. Sample a proposal \hat{\theta} from the prior p(\theta|M) .
3. Calculate the observed statistic s^*=S(x^*) .
4. Generate a dataset \hat{x} from the likelihood p(X|M,\hat{\theta} ) .
5. Calculate the statistic \hat{s} =S(\hat{x} ) for the generated dataset.
6. Accept \hat{\theta} if \left\|\hat{s} -s^*\right\| \leq \delta, else reject.
7. Repeat steps 4-6 until n proposals have been accepted.
8. Estimate the posterior expectation of \theta as \mathbb{E} (\theta|M,X=x^*) \simeq\frac{1} {n} \sum_{k=1}^n \hat{\theta}_k , the mean of the accepted proposals.

The accepted proposals are now taken from the distribution p(\theta|\|S-s^*\|\leq\delta) , and we hope this is close enough to the true posterior to not introduce too much error.

Again, we consider the Lucky Dip example again. We choose the Euclidean distance metric, and the ABC method for approximating the number r of red fish is now the following.

ABC3FISH
1. Decide on the acceptance number n and the tolerance \delta.
2. Sample a proposal \hat{r} \in \{0,1,2,\ldots,f\} , with each possibility equally likely.
3. Generate a play record \hat{x} of five players, where each player starts with \hat{r} red fish and f-\hat{r} blue fish in the tank.
4. Count the number of losers, and the number of players that win on the first, second, and third draws, to calculate the generated statistic \hat{s} .
5. Accept \hat{r} if \|\hat{s} -(2,0,1,2) \|\leq\delta, else reject.
6. Repeat steps 2-5 until n proposals have been accepted.
7. Estimate the posterior expectation of r as \mathbb{E} (r|M,X=(3,0,0,2,3)) \simeq\frac{1} {n} \sum_{k=1}^n \hat{r}_k , the mean of the accepted proposals.

Here are some results, with the tolerance set to one. This means a proposal is accepted if the simulated data is equal to the observed data, or one element is off by one. The acceptance region is technically a ball, but since the data is discrete on a uniform grid, a tolerance of one results in acceptable datasets being in the shape of a cross.

ABC3FISH

Here’s another with the tolerance set to 12, which I’ll call ABC3aFISH.

ABC3aFISH

Some comments:
1. Since the data in the example is discrete, and the space of possible statistics is small, using a non-zero tolerance is probably not justified. Still, it’s good enough to illustrate the idea.

2. ABC3FISH took an hour and three quarters, not much less than the zero-tolerance ABC2FISH algorithm. ABC3aFISH, on the other hand, took a minute and a half.

3. While ABC3FISH converges roughly to the same answer, ABC3aFISH converges to the prior estimate of 50\%.This is because the tolerance is so high that it always accepts. Since this makes the condition \|S-s^*\|\leq\delta tautological, ABC3aFISH is effectively sampling from the prior. Indeed, the leftmost box, shows individual proposals to be roughly evenly distributed across the entire range, as expected from our flat prior.

On the other hand, ABC3FISH and ABC2FISH are slower, but their sampling is closer to the posterior. Setting the tolerance is thus a balance between the number and the accuracy of accepted proposals, or between the known prior and the unknown posterior.

3. Choices of Tolerance

Now we can reduce the computation time if needs be, let’s think about how to keep our estimates accurate while doing so. Specifically, what value should we choose for the tolerance?

I mentioned at the beginning of the post that a higher acceptance rate made up for non-sufficient statistics, up to a certain point. The same is true for non-zero tolerances. As the number of proposals increases, the estimate will tend towards the incorrect posterior expectation. Decreasing the tolerance would result in the estimate tending towards a more-correct expectation, but the lower rate of acceptance would mean the convergence would be slower.

Say we draw the expected error against number of proposals – computation time – for both choices of tolerances. The curve for each would decrease as computation time increases, but flatten out at a certain level. The lower tolerance’s curve would fall more slowly at first, but eventually overtake the other tolerance’s curve as the latter flattens out, eventually flattening out at a lower level.

The lower tolerance will thus give a better estimate once the number of proposals is high enough. How large is this critical number? Who knows? Outside of simple examples – where the real answer is known, and so is the error of estimates – this is infeasible to find. Still, we can at least say that the optimal choice of tolerance decreases as the number of proposals increases.

What if we want something more specific? How about how quickly the optimal tolerance drops, or how quickly the error drops? It depends on how the error is defined, which may vary between individual problems. I’ll introduce one definition next time. The introductory overview’s done, unless I write about more complex variants of ABC some time. Things become more specific, and mathematical, from here.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s