Overview
This article is the second of the Bayesian series. Previous episode here.
You probably already know the Bayes theorem or stumble upon Naive Bayes when comparing machine learning models. Maybe, like me, you feel that you barely saw the shores of the continent that is the Bayesian land and want to set foot in it.
I wrote this article from a Stanford course with the intent for you to understand the relationships between all Bayesian concepts without having to go through an entire classroom (like I did). As a matter of readability, I hide most of the maths in dropdowns. This way, you can follow the logic easily and come back to deep dive into formulas later.
So far we have dealt with Bayesian analysis of very simple models, mainly using conjugate priors, with few parameters to estimate. In the coin tossing example, we had just one parameter, the probability of heads. In the Gaussian model, we had two parameters, the mean and variance . However, in modern applications one typically must estimate many parameters, using more complicated hierarchical models.
1. Empirical Bayes
1.1 Estimation of the normal mean
Suppose that we have observations each from independent Gaussians all with same variance 1 but with different means
This is a high-dimensional model: there are as many parameters as there are data. Another way to write it with N-vectors:
Suppose we want to estimate the vector μ. The natural “frequentist” thing to do is just to use the MLE, which in this case is simply
On the other hand, if we performed a Bayesian analysis, it would be natural to choose the prior
We compute the difference of risk between the frequentist MLE and the Bayes estimator
Frequentist
The risk of the MLE () is the expectation of its squared error loss
since the are all independent normals with mean and variance 1.
Bayesian:
The Bayes estimator of each is
This is given by the following theorem, when we consider m = 0.
If and then
Where
So the posterior means shrinks the MLE. Let’s compute the frequentist risk of this estimator, which is needed to obtain the Bayes risk.
So the Bayes risk of this estimator is
So the difference between the risk of the MLE and the Bayes estimator is
, which can be very large if τ is small.
1.2 Admissibility and Empirical Bayes
Obviously we want estimators that have low risk. For any loss function we know that:
Frequentist deal with the fact that is unknown, the risk still depends on
Bayesians deal with this issue by integrating a second time over with respect to the prior distribution.
Where ξ are prior hyperparameters. However frequentists don’t use prior, so instead we can use admissibility and minimax.
An estimator is inadmissible if there exists another estimator for which:
So an estimator is admissible if it is not inadmissible. dominate if for every .
1.3 The James-Stein estimator
If we don’t know how to choose the parameter of , we try to estimate using empirical Bayes via the James-Stein estimator.
We prove that the JS estimator has a larger risk than the Bayes estimator (indeed it must be), but it dominates the MLE everywhere if the dimension is large enough (N ≥ 2).
So we have
and
Thus
where has N degrees of freedom and
With our JS estimator defined as
the integrated risk of our JS estimator is:
This is larger than the Bayesian risk, found in 1.1, and we have
Thus JS very similar to Bayes estimator when N is large.
The proof also show that
So the JS estimator dominates the MLE if N ≥ 2.
See the full proof here.
So why one would use the MLE is high-dimension? Because:
The JS estimator is biased, since the MLE is the minimum variance unbiased estimator (Rao-Blackwell theorem). So the JS reduces its variance but adds bias: good for point estimation but challenging for frequentist interval estimation.
The MLE is admissible in 1 and 2-dimensions, but even in higher dimensions, if we focus on a single “outlier” entry, the MLE has a better MSE (see why below).
In high dimension, suppose we generate values from
We put
For each simulation replicate, we estimate μ by the MLE and by JS. We find:
1-4 Full Bayes inference
We could conclude at this point that one should always use empirical Bayes to choose a prior. However, when it comes to estimating a parameter, the Bayesian approach is to put a prior on it.
Thus if our problem is
now our problem is to estimate . Since we want to estimate from our data, we have to place an objective prior on :
So we compute the posterior of μ to find out that its posterior expectation is challenging to evaluate, although our initial model is simple.
We now have:
Let’s compute the posterior
I spare you the lengthy proof of this result, which would need an article in itself.
We need some alternative way to do integration that doesn’t involve actually computing all of the integrals analytically.
2. Markov Chain & Monte Carlo
We still have the same problem. Now we know that getting the posterior is hard if we can’t find the normalising constant through some known distribution.
Another choice is the half Cauchy prior on τ
We choose to have an Inverse Gamma prior on and a Normal prior on .
We show that computing the posterior become very ugly
The half Cauchy prior is not a conjugate prior, so now we consider:
The posterior can be written
We try to find the normalizing constant :
However, in high-dimensions we don’t really care about the whole posterior, but more about some functional of it, like the posterior expectations of a function f:
It turns out this integral can be approximated using Markov chain Monte Carlo (MCMC). It consists in sampling the desired distribution by recording states from the Markov chain, which has the desired distribution as its equilibrium distribution.
Let’s see the difference with ordinary Monte Carlo.
Suppose we have and we want to compute the quantity
One idea in to take samples from g(θ), say
and then use
If we could sample Y ∼ G, it would be just ordinary Monte Carlo.
A Markov chain is a sequence of random variable ignoring the whole history (independent) except for the last event:
We hop from one state to state via a transition kernel . Some have invariant distribution.
Let’s see how it suggests the Gibs sampling model to construct K.
so our transition matrix K is
In high dimension, K is no longer a matrix but an operator mapping function. This way, the density
can be written
How can we use Markov Chain to compute expectations with respect to the posterior?
Some K have an invariant distribution:
If has density , this means
If the invariant measure exists and is unique then:
If we start at the state ν, the distribution will look more and more like the invariant distribution.
Thus if we could make the posterior our invariant distribution, we could sample from it using Markov chains:
if,
then
if has invariant density . That means the empirical averages converge to expected value we are looking for.
2.1 Gibbs sampling
We want to construct for arbitrary distributions. We need to compute the posterior without knowing the normalising constant .
In Gibbs sampling the idea is to break the problem of sampling from the high-dimensional joint distribution into a series of samples from low-dimensional conditional distributions.
We illustrate the algorithm here
Find a with invariant distribution .
Start somewhere .
Simulate forward a lot of steps.
Throw out B steps (burn-in, intermediary steps, far from the convergence).
Average the rest.
Here steps 1 to 3 look like this:
as this has the invariant distribution
Here is an application of Gibbs sampling to a Probit model:
so:
thus
now if then
Thus, we can use Gibbs to do computations with this model.
A nice implementation can be found here:
2.2 Metropolis-Hastings algorithm
If one of the Gibbs conditionals is not conjugate, a Metropolis-Hasting step is used in the Gibbs iteration.
Here is an implementation: MCMC sampling for dummies
This new step ensure the transition kernel is reversible, which guarantee that the Markov chain converges to the correct invariant measure.
The core idea is to compute an acceptance ratio, which is the probability to jump to the next step.
We illustrates the algorithm here with an example and a measure of MSE to assess the quality of convergence.
Suppose we want to sample from a density , with normalising constant
Choose a proposal with density . For exemple
At state , propose new state:
Compute the acceptance ratio:
With probability , , otherwise with proba
If we take the model:
the only parameter is , so:
Let’s use MH. For take:
with s extra parameter to make convergence slower or faster (ideal is a ratio of 0.23). The acceptance ratio is
We know that:
We assess the error with MSE (which is frequentist)
If we sample independently it becomes:
If we start at ν = μ, then the bias would be 0 but not the covariance.
To have convergence of the MSE, we would like an exponential decay of the covariance
We can diagnosis convergence via:
Asymptotic variance (or effective sample size)
Trace plotting
Comparison across multiple chains
Coupling
2.2.a Asymptotic variance
The Effective sample size (ESS) allows us to estimate σ, thanks to the Central Limit Theorem.
We have:
where
We want to estimate .
For batches there are such batches and
where
So the ESS is:
In the special case of Monte Carlo, note ESS = n, but because of correlation in the Markov chain this might be much worse.
2.2.b Plotting Autocorrelation
We plot the autocorrelation of the chain, ideally quickly decreasing to zero (the chain itself looks like white noise).
Let’s see how to use Gibb sampling to compute conditional mean and variance.
Suppose we are interested in sampling
Let’s use Gibbs, since sampling from the conditionals is straightforward:
then the conditionals mean and variance are given by
2.2.c Comparison
We run multiple choices and compare them using the Gelman-Rubin diagnostic.
The Gelman-Rubin diagnostic is:
run chains of length , starting over dispersed points, discard first
compute the between variance:
and the within variance
and take the estimated variance to be
However the shortcomings of this idea is to only consider the effect of auto-correlations for the identity function.
We need to bound autocorrelations for a huge class of functions if we want to converge in total variation
2.2.d Coupling
We run m pairs of chains to simulate two dependent Markov chains, and run the chains until they meet
We have:
so that marginally
We define the total variation metric between distributions:
The coupling inequality states that
Thus the idea is to simulate from two dependent Markov chains at different points
and run the chain until they meet
For exemple take:
if output
else sample and take until and output
3. Nontrivial models
3.1 Linear regression
We see how Bayesian linear regression
easily leads to ridge estimate
where z is the response, W is the matrix, β is the vector of regression coefficients, and ϵ is the vector of random error.
The usual least square estimate is
which is also the MLE for β. The MLE for σ2 is given by
These work fine when N >> d, but if d is big, using shrinkage make sense. So we consider the Bayesian hierarchical model
Thus, we have
so
with
Thus the posterior mean is:
This is the usual ridge regression estimate, which solves
3.2 Tikhonov regularisation
More generally we have Tikhonov regularization
where A symmetric, positive definite.
We regularise because:
Adding a bias may improve our estimation of a vector with more than 2 dimensions (see JS estimator)
When p > N, least square can’t be done, but regularised solution can
We see in that its much harder to get A with Empirical Bayes than Full Bayes (in particular when β doesn’t follow a Normal distribution).
We maximise the marginal likelihood (in term of )
Pick a prior structure:
which can be reframed
so:
We compute easily the likelihood:
Now we just need to multiply by the prior density for and integrate over . This is easy since the function is the kernel of an inverse-gamma distribution.
Then we pick according to:
If were not coming from a Normal distribution, this problem could quickly become intractable (above all because is high-dimensional).
Full Bayes To estimate A, we put a prior on τ . What possibilities?
Discrete Uniform on a set of grid points
Uniform on an interval
Half-Cauchy
Inverse Gamma
We use Half-Cauchy: it requires no hyperparameters and it is reasonably uninformative (it let the data speak for itself). The full model is:
This can be done using Gibbs sampling
The first two distribution are easy to sample from. The third distribution is well known and we can directly sample it with Metropolis-Hastings
3.3 Conjugate prior
Another option to solve Bayesian linear regression is using a conjugate prior.
We see how Zellner’s g prior gives us a posterior with a mean that is a shrunken factor of the MLE.
We have:
which is equivalent to:
so:
And if
then the posterior is:
where
If we use the Zellner’s g prior:
and we find that the posterior is still Normal InvGamma, but his mean has become
3.4 Bayesian variable selection
We saw that for Ridge or penalisation the problem was
In high-dimension, Lasso or penalisation allows us to select features automatically. The ideal problem in this case is:
We still need to select λ with AIC (λ = 1) or BIC (λ = 2log(n)).
If we think of variable selection as hypothesis testing, we can compute the Bayes factor for each variable.
We can think of variable selection as:
Bayesian testing in case of one hypothesis is
We can then compute p(z|H1) and p(z|H0) to find the Bayes factor.
Final words
Thank you for reading, it has been quite a journey and I hope that I successfully put those complex concepts in a short and easy form. There are some nice state of the art here and here that I leave you to exploring. See you on a next post!