Introduction to Approximate Bayesian Computation

A paper called The Rate of Convergence of Approximate Bayesian Computation, of which I’m a co-author, went online recently. As such, I’m going to break from beginner-friendly posts for a while, and outline the subject that the paper’s about. My goal is to work up to talking about the results proved in the paper.

We’re going to assume we want to make inference about some process, using some observations x^*. An example of observations would be our observing the outcome for previous players in a game of Pólya’s Lucky Dip. To understand the process, we have at least one model M of how the process works, which takes some model parameters \theta and generates some observations X.

We obviously need to work out whether the model is correct, but we’d also like to work out what the model parameters are if it is correct. This would then allow us to have a more complete model of how the process works. More importantly, knowledge of the parameters would allow us to make predictions about data observed in the future. We can thus forecast future results, and also use results we observe later to decide whether the model is adequately describing reality. Compare this to just predicting the values of the model parameters, which aren’t observable, and usually don’t exist in reality.

Let’s discuss the inference of the model parameters first. Under the Bayesian framework, we say that our belief in the values of the parameters, before making observations, has the prior probability density p(\theta|M) . We then want to update our belief after making observations, giving the posterior density p(\theta|M,X=x^*) . The usual way to find this is to use Bayes’s Theorem:

p(\theta|M,X=x^*) =p(\theta|M) \frac{\mathbb{P} (X=x^*|M,\theta) } {\mathbb{P} (X=x^*|M) } .

One possible problem is that the scaling factor \mathbb{P} (X=x^*|M) can’t be calculated exactly. There are various ways to deal with this, such as doing numerical integration on \int p(\theta,X=x^*|M) . I won’t discuss these any further. Another, more complicated problem is when we cannot calculate the likelihood \mathbb{P} (X=x^*|M,\theta) – a measure of how likely the observed data was to occur in comparison to different possible data – or when doing so is computationally infeasible.

This occurs rather often, because, if we wish to make inferences about a process that isn’t easily simplified, the way in which data is generated by the model can be exceedingly complicated. Typical problems we’d be looking at in this context are: trying to work out when different species of animal diverged from each other by examining fossilised DNA samples; trying to work out how modern humans spread out of Africa; analysing how the SARS virus spread in Hong Kong. Messy problems, complicated problems, problems that can’t be reduced to a model with simple likelihoods without losing important properties of how the process works.

So, what can we do instead? We can try some sort of Monte Carlo method. Approximate Bayesian Computation, or ABC, is one of the more naïve methods, because it doesn’t assume anything outside of what I mentioned above: the model M, the prior p(\theta|M) , and a method to generate data from model parameters, which is described by the incalculable likelihood p(X|M,\theta) for any data value X.

There are about as many variants of ABC as there are people using it, if not more. I’ll give a few basic versions as we go. First, I’ll outline the basic philosophy: we generate samples of the model parameters \theta from the prior distribution, then put these into the model to generate a sample dataset. We then compare this sample dataset to the observed dataset to decide how to use the proposed parameter values. We do this for a bunch of generated parameter proposals. Informed by the data comparisons, we then use these proposals to make an estimate of whatever we want to know about the model parameters.

There’s a lot of leeway in that description, so here’s a very basic example:

1. Decide on the acceptance number n.
2. Sample a proposal \hat{\theta} from the prior p(\theta|M) .
3. Generate a dataset \hat{X} from the likelihood p(X|M,\hat{\theta} ) .
4. Accept \hat{\theta} if the \hat{X} =x^*, else reject.
5. Repeat steps 2-4 until n proposals have been accepted.
6. Estimate the posterior expectation of \theta as \mathbb{E} (\theta|M,X=x^*) \simeq\frac{1} {n} \sum_{k=1}^n \hat{\theta}_k , the mean of the accepted proposals.

Since the proposals are only accepted when the generated data is equal to the observed data, the accepted proposals are taken exactly from the posterior distribution p(\theta|X=x^*) .

For example, say we’re again trying to get red fish out of a tank of red and blue fish. There are f fish, and our prior on the number of red fish in the tank is the flat prior p(r)=1/(n+1) for r\in\{0,1,2,\ldots,n\} . Each player has three attempts to draw out a red fish, before he replaces his fish and the next player has a go. We observe several players, with a play record such as, say,

(3,1,2,1,0,1,\ldots) ,

where 0 indicates a player losing, and a number in \{1,2,3\} indicates a player winning on that draw. Our model M will be the assumption that each remaining fish in the tank has an equal chance of being drawn each time.

Let’s say our play record is (3,0,0,2,3) . Then our simple ABC method for approximating the number r of red fish becomes the following.

1. Decide on the acceptance number n.
2. Sample a proposal \hat{r} \in \{0,1,2,\ldots,f\} , with each possibility equally likely.
3. Generate a play record of five players, where each player starts with \hat{r} red fish and f-\hat{r} blue fish in the tank.
4. Accept \hat{r} if the generated play record matches the observed record (3,0,0,2,3) , else reject.
5. Repeat steps 2-4 until n proposals have been accepted.
6. Estimate the posterior expectation of r as \mathbb{E} (r|M,X=(3,0,0,2,3)) \simeq\frac{1} {n} \sum_{k=1}^n \hat{r}_k , the mean of the accepted proposals.

I took a hundred such estimates for several values of n. The results are given as boxplots below. As expected, the boxplots appear to be sized proportionally to \sqrt{n} .


Some comments:
1. It can take a long time to accept any proposals if the play record is large, or is a record we would consider an extremely rare occurrence under the prior. For example, even for such a seemingly simple problem as this one, the above picture took a day or two to produce, so one estimate with n=1000 would have taken a few hours.

This is partly because the algorithm is being overly picky about whether to accept a proposal. It wants the record to be exactly the same, with the same order, when all we want is for the frequency of each player outcome to be the same. In this case, that means two losers, no winning first draws, one winning second draw, and two winning third draws. This can be mitigated by introducing summary statistics, which I’ll talk about next time.

2. Our estimate is in the form of a point estimate rather than an interval. That means it doesn’t give us any idea of how uncertain we are about the resulting estimate. This can be mitigated by using the accepted proposals to make an estimate of the posterior density rather than the expectation. Obviously we shouldn’t just assume there’s no uncertainty at all.

3. Both of the above can lead to making an estimate from a small number of accepted proposals. This can lead to some stupid reasoning if we assume there’s no uncertainty in the estimate. For example, say our record is (1,1,1,1,1,1) , i.e. we had six players, and they all happened to win immediately. We’d expect the resulting estimate to tend towards higher number of fish. However, if our only accepted proposal happens to be \hat{r} =1, our resulting estimate would be guessing the number of red fish to be as small as practically possible. A result this heinous won’t happen too much, and the chance of it happening will go down as we accept more proposals, but with a point estimate we have no idea how likely events like this are for a particular acceptance number.

4. Since the amount of proposals generated to accept n of them is random, the computation time of the ABC method above is random. This is often inconvenient for practical purposes, since we might be on time constraints for coming up with an estimate. A common alteration, is to set the number N of generated proposals instead, in which case n is random. I won’t talk about this much, partly to avoid discussing what happens when no proposals are accepted.

5. I’ve said nothing at all about making forecasts about future observables, which is unsatisfactory for the reasons I gave in the third paragraph. This doesn’t really get done with ABC, and it’s not clear whether there’s an efficient way to do so, since the usual way to make forecasts requires knowledge of the likelihood. You can’t use all the records generated, since they’re distributed based on the prior, and you can’t use the ones associated with the accepted proposals, since they’re all equal to our observed record. That would lead us to conclude that future play records are certain to be the same as the one we observed, which is absurd. My first guess would be to take each of the accepted proposals, and generate a set number of new records for each one, but this is infeasible for the more complex problems ABC is usually used for.

To sum up, ABC is simple to use, because it doesn’t use any assumptions outside of the prior and how to generate data from parameters under the model assumptions. However, we pay for the naïvety of this method by getting a point estimate with no sense of uncertainty, and, more importantly, with its taking a long time to calculate anything. The latter is particularly heinous – if the simple problem above would take a few hours to get a decently accurate estimate, imagine how long it takes to estimate how modern humans spread out of Africa. I’ll talk about ways of cutting the computation time in the next post.

4 thoughts on “Introduction to Approximate Bayesian Computation

      • But if they HAVE to be the same thing then wouldn’t it just equal out to zero anyways?

      • You mean the difference between the datasets? If the generated data has to be equal to the observed data to accept a parameter proposal, then yes, the difference between generated data for accepted proposals and the observed data will be zero.

        But we’re not interested in the generated dataset, we’re interested in the proposed parameter values that are associated with it, and those will vary.

        Maybe a simpler example would help? Say our parameter is the chance a certain coin comes up heads when flipped. Our observed data is that we flipped it once, and it came up heads. Then ABC1 would randomly choose a probability of heads, and use that to generate a dataset that is either “heads” or “tails”.

        All the accepted sets would have a dataset equal to “heads”. The accepted generated probabilities, which are what we’re interested in, would be on a range in [0,1], but would be more likely to take values closer to 1.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s