/
...
/
/
Hierarchical gaussian models, empirical bayes and shrinkage
Search
Duplicate
Try Notion

Hierarchical gaussian models, empirical bayes and shrinkage

Overview

This article is the second of the Bayesian series. Previous episode here.
You probably already know the Bayes theorem or stumble upon Naive Bayes when comparing machine learning models. Maybe, like me, you feel that you barely saw the shores of the continent that is the Bayesian land and want to set foot in it.
I wrote this article from a Stanford course with the intent for you to understand the relationships between all Bayesian concepts without having to go through an entire classroom (like I did). As a matter of readability, I hide most of the maths in dropdowns. This way, you can follow the logic easily and come back to deep dive into formulas later.
So far we have dealt with Bayesian analysis of very simple models, mainly using conjugate priors, with few parameters to estimate. In the coin tossing example, we had just one parameter, the probability of heads. In the Gaussian model, we had two parameters, the mean μμ and variance σ2σ^2. However, in modern applications one typically must estimate many parameters, using more complicated hierarchical models.

1. Empirical Bayes

1.1 Estimation of the normal mean

Suppose that we have NN observations each from independent Gaussians all with same variance 1 but with different means
yi  indN(μi,1)y_i \;\sim^{ind}N(\mu_i,1)
This is a high-dimensional model: there are as many parameters as there are data. Another way to write it with N-vectors:
y    N(Iμ,I)y\;\sim\;N(I\mu,I)
Suppose we want to estimate the vector μ. The natural “frequentist” thing to do is just to use the MLE, which in this case is simply
μ^=y\hat{\mu}=y
On the other hand, if we performed a Bayesian analysis, it would be natural to choose the prior
μi  iid  N(m,τ2)\mu_i\;\sim^{iid}\;N(m,\tau^2)
We compute the difference of risk between the frequentist MLE and the Bayes estimator
Frequentist
The risk of the MLE (yiy_i) is the expectation of its squared error loss
E[μμ^22]=i=1NE[(μiμ^i)2]=i=1NE[(μiyi)2]=NE[||\mu-\hat{\mu}||^2_2]=\sum^N_{i=1}E[(\mu_i-\hat{\mu}_i)^2]=\sum^N_{i=1}E[(\mu_i-y_i)^2]=N
since the yiy_i are all independent normals with mean μiμ_i and variance 1.
Bayesian:
The Bayes estimator of each μiμ_i is
E[μiy,m,τ2]=yi1+τ2E[\mu_i|y,m,\tau^2]=\frac{y_i}{1+\tau^{-2}}
This is given by the following theorem, when we consider m = 0.
If yN(μ,Σ)y \sim N(\mu, \Sigma) and μN(m,Ψ)\mu \sim N(m, \Psi) then yμN(Vξ,V)y|\mu \sim N(V\xi, V)
Where V=(Ψ1+Σ1)1,  ξ=(Ψ1m+Σ1y)V = (\Psi^{-1}+\Sigma^{-1})^{-1},\;\xi = (\Psi^{-1}m + \Sigma^{-1}y)
So the posterior means shrinks the MLE. Let’s compute the frequentist risk of this estimator, which is needed to obtain the Bayes risk.
E(yiτ21+τ2μi)2=var(yiτ21+τ2)+bias2(yiτ21+τ2)=τ4(1+τ2)2+μi2(1+τ2)2E(\frac{y_i\tau^2}{1+\tau^2}-\mu_i)^2=var(\frac{y_i\tau^2}{1+\tau^2})+bias^2(\frac{y_i\tau^2}{1+\tau^2})=\frac{\tau^4}{(1+\tau^2)^2}+\frac{\mu_i^2}{(1+\tau^2)^2}
So the Bayes risk of this estimator is
iEm,τ2[τ4(1+τ2)2+μi2(1+τ2)2]=NEm,τ2[τ4(1+τ2)2+μi2(1+τ2)2]=Nτ2τ2+1\sum_i E_{m,\tau^2}[\frac{\tau^4}{(1+\tau^2)^2}+\frac{\mu_i^2}{(1+\tau^2)^2}]=NE_{m,\tau^2}[\frac{\tau^4}{(1+\tau^2)^2}+\frac{\mu_i^2}{(1+\tau^2)^2}]=N\frac{\tau^2}{\tau^2+1}
So the difference between the risk of the MLE and the Bayes estimator is
NNτ2τ2+1=Nτ2+1N-N\frac{\tau^2}{\tau^2+1}=\frac{N}{\tau^2+1}
RfrequentistRBayesian=Nτ2+1R_{frequentist}-R_{Bayesian}=\frac{N}{\tau^2+1}, which can be very large if τ is small.

1.2 Admissibility and Empirical Bayes

Obviously we want estimators that have low risk. For any loss function LL we know that:
Frequentist deal with the fact that θ\theta is unknown, the risk still depends on θθ
R(θ,δ)=L(θ,δ(y))p(yθ))dyR(\theta,\delta)=\int L(\theta,\delta(y))p(y|\theta))dy
Bayesians deal with this issue by integrating a second time over θ\theta with respect to the prior distribution.
Rˉ=L(θ,δ(y))p(yθ)p(θξ)dydθ=Eξ[Eθ[L(θ,δ(y))]]\bar{R}=\int \int L(\theta,\delta(y))p(y|\theta)p(\theta|\xi)dyd\theta=E_{\xi}[E_{\theta}[L(\theta,\delta(y))]]
Where ξ are prior hyperparameters. However frequentists don’t use prior, so instead we can use admissibility and minimax.
An estimator δ\delta^* is inadmissible if there exists another estimator δ\delta for which:
R(θ,δ)R(θ,δ)R(\theta,\delta)\leq R(\theta,\delta^*)
So an estimator is admissible if it is not inadmissible. δ\delta dominate δ\delta^* if R(θ,δ)R(θ,δ)R(θ, δ)≤R(θ, δ^*) for every θ\theta.

1.3 The James-Stein estimator

If we don’t know how to choose the parameter τ\tau of μ\mu, we try to estimate μ\mu using empirical Bayes via the James-Stein estimator.
μ^(JS)=(1N2y2)  y\hat{\mu}^{(JS)}=(1-\frac{N-2}{||y||^2})\;y
We prove that the JS estimator has a larger risk than the Bayes estimator (indeed it must be), but it dominates the MLE everywhere if the dimension is large enough (N ≥ 2).
So we have
yμ    N(μ,I),  μτ2    N(0,τ2I)y|\mu\;\sim\;N(\mu,I),\;\mu|\tau^2\;\sim \; N(0,\tau^2I)
and
y    N(0,(1+τ2)I).y\;\sim\;N(0,(1+\tau^2)I).
Thus
S=y2    S    (τ2+1)χN2S=||y||^2 \implies S\;\sim\;(\tau^2+1)\chi^2_N
where χN2\chi^2_N has N degrees of freedom and
E[N2S]=1τ2+1E[\frac{N-2}{S}]=\frac{1}{\tau^2+1}
With our JS estimator defined as
μ^(JS)=(1N2S)  y\hat{\mu}^{(JS)}=(1-\frac{N-2}{S})\;y
the integrated risk of our JS estimator is:
R(μ,μ^(JS))=Eμ[μμ^(JS)2]=Eμ[μyN2Sy2]=...=Nτ2τ2+1+2τ2+1R(\mu,\hat{\mu}^{(JS)})=E_{\mu}[||\mu-\hat{\mu}^{(JS)}||^2]=E_{\mu}[||\mu-y-\frac{N-2}{S}y||^2]\\=...\\=N\frac{\tau^2}{\tau^2+1}+\frac{2}{\tau^2+1}
This is larger than the Bayesian risk, found in 1.1, and we have
Rˉ(JS)Rˉ=1+2Nτ2\frac{\bar{R}^{(JS)}}{\bar{R}}=1+\frac{2}{N\tau^2}
Thus JS very similar to Bayes estimator when N is large.
The proof also show that
R(μ,μ^(JS))=NEμ[(N2)2S]R(\mu,\hat{\mu}^{(JS)})=N-E_\mu[\frac{(N-2)^2}{S}]
So the JS estimator dominates the MLE if N ≥ 2.
See the full proof here.
So why one would use the MLE is high-dimension? Because:
The JS estimator is biased, since the MLE is the minimum variance unbiased estimator (Rao-Blackwell theorem). So the JS reduces its variance but adds bias: good for point estimation but challenging for frequentist interval estimation.
The MLE is admissible in 1 and 2-dimensions, but even in higher dimensions, if we focus on a single “outlier” entry, the MLE has a better MSE (see why below).
In high dimension, suppose we generate values from
yi    N(μi,1)  for  i=1,...,11y_i\;\sim\;N(\mu_i,1)\;for\;i=1,...,11
We put
μi=(5,1,0.75,...,0.75,1)\mu_i=(5,-1,-0.75,...,0.75,1)
For each simulation replicate, we estimate μ by the MLE and by JS. We find:

1-4 Full Bayes inference

We could conclude at this point that one should always use empirical Bayes to choose a prior. However, when it comes to estimating a parameter, the Bayesian approach is to put a prior on it.
Thus if our problem is
y    N(μ,I)μ    N(0,τ2I)τ2    p(τ2)y\;\sim\;N(\mu,I)\\\mu\;\sim\;N(0,\tau^2I)\\\tau^2\;\sim\;p(\tau^2)
now our problem is to estimate p(τ2)p(\tau^2). Since we want to estimate τ2\tau^2 from our data, we have to place an objective prior on τ2\tau^2: p(τ2)=1p(τ^2)=1
So we compute the posterior of μ to find out that its posterior expectation is challenging to evaluate, although our initial model is simple.
We now have:
y    N(μ,I)μ    N(0,τ2I)τ2    p(τ2)=1y\;\sim\;N(\mu,I)\\\mu\;\sim\;N(0,\tau^2I)\\\tau^2\;\sim\;p(\tau^2)=1
Let’s compute the posterior
p(μ,τ2y)p(yμ)p(μτ2)p(τ2)=(2π)Ne12(yμ)I(yμ)(τ2)N/2e12μ(τ2I)1μ=(2π)N(τ2)N/2e12(μ(1+τ2)1y)(1+τ2)I(μ(1+τ2)1y)e12y(1(1+τ2)1Iy)p(\mu,\tau^2|y)\propto p(y|\mu)p(\mu|\tau^2)p(\tau^2) = (2\pi)^{-N}e^{-\frac{1}{2}(y-\mu)'I(y-\mu)}(\tau^2)^{-N/2}e^{-\frac{1}{2}\mu'(\tau^2I)^{-1}\mu}\\=(2\pi)^{-N}(\tau^2)^{-N/2}e^{-\frac{1}{2}(\mu-(1+\tau^{-2})^{-1}y)'(1+\tau^{-2})I(\mu-(1+\tau^{-2})^{-1}y)}e^{-\frac{1}{2}y'(1-(1+\tau^2)^{-1}Iy)}
I spare you the lengthy proof of this result, which would need an article in itself.
We need some alternative way to do integration that doesn’t involve actually computing all of the integrals analytically.

2. Markov Chain & Monte Carlo

We still have the same problem. Now we know that getting the posterior is hard if we can’t find the normalising constant through some known distribution.
Another choice is the half Cauchy prior on τ
p(τ)11+τ2p(\tau)\propto\frac{1}{1+\tau^2}
We choose to have an Inverse Gamma prior on τ2\tau^2 and a Normal prior on mm.
We show that computing the posterior become very ugly
The half Cauchy prior is not a conjugate prior, so now we consider:
y    N(μ,σ2I)μ    N(m,σ2τ2I)p(τ)11+τ2y\;\sim\;N(\mu,\sigma^2I)\\\mu\;\sim\;N(m,\sigma^2\tau^2I)\\p(\tau)\propto\frac{1}{1+\tau^2}
The posterior can be written
p(m,τ2,σ2,μy)p(yμ,σ2)p(μσ2,τ2,m)p(σ2)p(τ2)p(m)p(m,\tau^2,\sigma^2,\mu|y)\propto p(y|\mu,\sigma^2)p(\mu|\sigma^2,\tau^2,m)p(\sigma^2)p(\tau^2)p(m)
We try to find the normalizing constant p(y)p(y):
2πσ2I1/2e12(yμ)T(σ2I)1(yμ)2πσ2τ2I1/2e12(μm)T(σ2τ2I)1(ym)dμ=(2πσ2)N/2(τ2)N/2e(ym)(Σ)1(ym)\int|2\pi\sigma^2I|^{-1/2}e^{-\frac{1}{2}(y-\mu)^T(\sigma^2I)^{-1}(y-\mu)}|2\pi\sigma^2\tau^2I^{-1/2}|e^{-\frac{1}{2}(\mu-m)^T(\sigma^2\tau^2I)^{-1}(y-m)}d\mu\\=(2\pi\sigma^2)^{-N/2}(\tau^2)^{-N/2}e^{(y-m^*)(\Sigma^*)^{-1}(y-m^*)}
However, in high-dimensions we don’t really care about the whole posterior, but more about some functional of it, like the posterior expectations of a function f:
f(θ)p(yθ)p(θ)p(y)dθ\int\frac{f(\theta)p(y|\theta)p(\theta)}{p(y)}d\theta
It turns out this integral can be approximated using Markov chain Monte Carlo (MCMC). It consists in sampling the desired distribution by recording states from the Markov chain, which has the desired distribution as its equilibrium distribution.
Let’s see the difference with ordinary Monte Carlo.
Suppose we have YGY \sim G and we want to compute the quantity E[log(Y)]\mathbb{E}[log(Y)]
One idea in to take samples from g(θ), say
θ1,...,θN  iid  G\theta_1,..., \theta_N\;\sim^{iid}\;G
and then use
E[logY]undefined=1nj=0n1log(θj)\widehat{\mathbb{E}[log\,Y]}=\frac{1}{n}\sum^{n-1}_{j=0}log(\theta_j)
If we could sample Y ∼ G, it would be just ordinary Monte Carlo.
A Markov chain is a sequence of random variable ignoring the whole history (independent) except for the last event:
p(XNX0,...,XN1)=p(XNXN1)p(X_N|X_0,...,X_{N-1})=p(X_N|X_{N-1})
We hop from one state nn to state n+1n+1 via a transition kernel KK. Some KK have invariant distribution.
Let’s see how it suggests the Gibs sampling model to construct K.
Xn{0,1}  and  α,β[0,1]Xn(Xn1=0)={0  with  proba  1α1  with  proba  αXn(Xn1=1)={0  with  proba  β1  with  proba  1βX_n\in\{0,1\}\;and\;\alpha,\beta\in[0,1]\\X_n|(X_{n-1}=0)= \Bigg\{ \begin{array}{ll}0\;with\;proba\;1-\alpha \\ 1\;with\;proba\;\alpha \end{array}\\X_n|(X_{n-1}=1)= \Bigg\{ \begin{array}{ll}0\;with\;proba\;\beta \\ 1\;with\;proba\;1-\beta \end{array}
so our transition matrix K is
K=[1ααβ1β]K=\Bigg[ \begin{array}{cc}1-\alpha & \alpha\\ \beta & 1-\beta \end{array} \Bigg]
In high dimension, K is no longer a matrix but an operator mapping function. This way, the density
X1X0    PX_1|X_0\;\sim\;P
can be written
p(x)k(x,y)dx=pˉ(y)\int p(x)k(x,y)dx=\bar{p}(y)
How can we use Markov Chain to compute expectations with respect to the posterior?
Some K have an invariant distribution: μK=μ\mu K=\mu
If μ\mu has density pp, this means
p(x)k(x,y)dx=p(y)\int p(x)k(x,y)dx=p(y)
If the invariant measure μμ exists and is unique then:
limnνKn=μ\lim_{n \rightarrow \infty}\nu K^n=\mu
If we start at the state ν, the distribution will look more and more like the invariant distribution.
Thus if we could make the posterior our invariant distribution, we could sample from it using Markov chains:
if  X0    ν  and  XnXn1    K(Xn1,.)\;X_0\;\sim\;\nu\;and\;X_n|X_{n-1}\;\sim\;K(X_{n-1},.),
then 1nk=0n1ϕ(Xk)n  i.p.ϕ(x)p(x)dx\frac{1}{n}\sum^{n-1}_{k=0}\phi(X_k)\rightarrow^{n\rightarrow\infty\;i.p.}\int\phi(x)p(x)dx
if KK has invariant density pp. That means the empirical averages converge to expected value we are looking for.

2.1 Gibbs sampling

We want to construct KK for arbitrary distributions. We need to compute the posterior without knowing the normalising constant p(y)p(y).
In Gibbs sampling the idea is to break the problem of sampling from the high-dimensional joint distribution into a series of samples from low-dimensional conditional distributions.
We illustrate the algorithm here
Find a KK with invariant distribution p(θy)p(θ|y).
Start somewhere X0νX_0 ∼ ν.
Simulate forward a lot of steps.
Throw out B steps (burn-in, intermediary steps, far from the convergence).
Average the rest.
Here steps 1 to 3 look like this:
Initialize  θ(0)RD  and  number  of  sample  NInitialize\;\theta^{(0)}\in R^{D}\;and\;number\;of\;sample\;N
for  i=0  to  N1for\;i=0\;to\;N-1
  θ0(i+1)  p(θ1θ2(i),...,θD(i))•\;\theta^{(i+1)}_0\sim\;p(\theta_1|\theta_2^{(i)},...,\theta^{(i)}_D)
......
  θj(i+1)  p(θjθ1(i+1),...,θj1(i+1),θj+1(i),...,θD(i))•\;\theta^{(i+1)}_j\sim\;p(\theta_j|\theta_1^{(i+1)},...,\theta^{(i+1)}_{j-1},\theta^{(i)}_{j+1},...,\theta^{(i)}_D)
  θD(i+1)  p(θDθ1(i+1),...,θD1(i+1))•\;\theta^{(i+1)}_D\sim\;p(\theta_D|\theta_1^{(i+1)},...,\theta^{(i+1)}_{D-1})
return  ({θ(i)}i=0N1)return \;(\{\theta^{(i)}\}^{N-1}_{i=0})
as this has the invariant distribution p(θ).p(θ).
Here is an application of Gibbs sampling to a Probit model:
yiiidBern(Φ(θ))=Bern(p(Zθ))y_i ∼_{iid} Bern(Φ(θ))=Bern(p(Z ≤ θ))
so:
yi=I[zi>0]zi    N(θ,1)y_i=\mathbb{I}_{[z_i>0]}\\z_i\;\sim\;N(\theta,1)
thus
ziyi    {N(0,)(θ,1)  if  yi=1N(,0](θ,1)  if  yi=0z_i|y_i\;\sim\;\bigg\{\begin{array}{cc}N_{(0,\infty)}(\theta,1)\;if\;y_i=1\\N_{(-\infty,0]}(\theta,1)\;if\;y_i=0 \end{array}
now if θ    N(0,τ2)\theta\;\sim\;N(0,\tau^2) then θz    N(nzˉτ2+n,1τ2+n)\theta|z\;\sim\;N(\frac{n\bar{z}}{\tau^{-2}+n},\frac{1}{\tau^{-2}+n})
Thus, we can use Gibbs to do computations with this model.
A nice implementation can be found here:

2.2 Metropolis-Hastings algorithm

If one of the Gibbs conditionals is not conjugate, a Metropolis-Hasting step is used in the Gibbs iteration.
Here is an implementation: MCMC sampling for dummies
This new step ensure the transition kernel is reversible, which guarantee that the Markov chain converges to the correct invariant measure.
The core idea is to compute an acceptance ratio, which is the probability to jump to the next step.
We illustrates the algorithm here with an example and a measure of MSE to assess the quality of convergence.
Suppose we want to sample from a density f(x,a)M(a)f(x, a)M(a), with M(a)M(a) normalising constant
Choose a proposal Q(x,.)Q(x, .) with density q(x,y)q(x, y). For exemple yN(x,s)y ∼ N(x, s)
At state nn, propose new state:
Xn    Q(Xn1,.)X_n^*\;\sim\;Q(X_{n-1},.)
Compute the acceptance ratio:
α(Xn1,Xn)=min(1,f(Xn,a)q(XnXn1)f(Xn1,a)q(Xn1Xn))\alpha(X_{n-1},X_n^*)=min\bigg(1, \frac{f(X_n^*,a)q(X^*_n|X_{n-1})}{f(X_{n-1},a)q(X_{n-1}|X_n^*)}\bigg)
With probability α\alpha, Xn=XnX_n=X_n^*, otherwise Xn=Xn1X_n=X_{n-1} with proba 1α1-\alpha
If we take the model:
yi    N(0,τ2),  p(τ)=11+τ2y_i\;\sim\;N(0,\tau^2),\;p(\tau)=\frac{1}{1+\tau^2}
the only parameter is ττ, so:
p(τy)p(yτ)p(τ)(i=1n12πσ2eyi2/2τ2)11+τ2p(\tau|y)\propto p(y|\tau)p(\tau) \propto \bigg(\prod^n_{i=1}\frac{1}{\sqrt{}2\pi\sigma^2}e^{-y_i^2/2\tau^2} \bigg)\frac{1}{1+\tau^2}
Let’s use MH. For q(τ,τ)q(\tau, \tau*) take:
τ    LogNormal(log  τ,s)\tau^*\;\sim\;LogNormal(log\;\tau,s)
with s extra parameter to make convergence slower or faster (ideal is a ratio of 0.23). The acceptance ratio is
α=p(τy)q(τ,τ)p(τy)q(τ,τ)=ττ\alpha=\frac{p(\tau^*|y)q(\tau^*,\tau)}{p(\tau|y)q(\tau,\tau^*)}=\frac{\tau}{\tau^*}
We know that:
1nj=0n1ϕ(Xk)nϕ(x)pN(x)dx=μϕ\frac{1}{n}\sum^{n-1}_{j=0}\phi(X_k)\rightarrow^{n\rightarrow\infty}\int\phi(x)p_N(x)dx=\mu\phi
We assess the error with MSE (which is frequentist)
E[(1n(ϕ(Xk))μϕ)2]=1nj=0n1l=0n1Cov(ϕ(Xj),ϕ(Xl))+1n2j=0n1l=0n1(νKjμ)ϕ(νKlμ)ϕ\mathbb{E}\Big[\Big(\frac{1}{n}\sum(\phi(X_k))-\mu\phi\Big)^2\Big]=\frac{1}{n}\sum^{n-1}_{j=0}\sum^{n-1}_{l=0}Cov(\phi(X_j),\phi(X_l))+\frac{1}{n^2}\sum^{n-1}_{j=0}\sum^{n-1}_{l=0}(\nu K^j-\mu)\phi(\nu K^l-\mu)\phi
If we sample independently it becomes:
1n2i=0n1V[ϕ(Xj)]=1nV[ϕ(X)]\frac{1}{n^2}\sum^{n-1}_{i=0}\mathbb{V}[\phi(X_j)]=\frac{1}{n} \mathbb{V}[\phi(X)]
If we start at ν = μ, then the bias would be 0 but not the covariance.
To have convergence of the MSE, we would like an exponential decay of the covariance
Cov(ϕ(Xj),ϕ(Xl))=αˉ(jl)  with  αˉ(0,1)Cov(\phi(X_j),\phi(X_l))=\bar{\alpha}^{(j-l)}\;with\;\bar{\alpha}\in(0, 1)
We can diagnosis convergence via:
Asymptotic variance (or effective sample size)
Trace plotting
Comparison across multiple chains
Coupling

2.2.a Asymptotic variance

The Effective sample size (ESS) allows us to estimate σ, thanks to the Central Limit Theorem.
We have:
n[1nj=0n1ϕ(Xj)μϕ]N(0,σ2)\sqrt{n}\bigg[\frac{1}{n}\sum^{n-1}_{j=0}\phi(X_j)-\mu\phi\bigg]\Rightarrow N(0,\sigma^2)
where
σ2=V[X0]j=0Corr(X0,Xj)\sigma^2=\mathbb{V}[X_0]\sum^{\infty}_{j=0}Corr(X_0,X_j)
We want to estimate σ2σ^2.
For batches bn=n1/3b_n = n^{1/3} there are n − bnn − b_n such batches and
σ2^=(n1)bn(nbn)(nbn1)j=0nbn1(Xˉj(bn)Xˉn)\hat{\sigma^2}=\frac{(n-1)b_n}{(n-b_n)(n-b_n-1)}\sum^{n-b_n-1}_{j=0}(\bar{X}_j(b_n)-\bar{X}_n)
where
Xˉj(bn)=1bni=1bnXj+i,    Xˉn=i=1n1Xi\bar{X}_j(b_n)=\frac{1}{b_n}\sum^{b_n}_{i=1}X_{j+i}, \;\;\bar{X}_n=\sum^{n-1}_{i=1}X_i
So the ESS is:
ESS=nVˉ[X0]σ^2ESS=n\frac{\bar{\mathbb{V}}[X_0]}{\hat{\sigma}^2}
In the special case of Monte Carlo, note ESS = n, but because of correlation in the Markov chain this might be much worse.

2.2.b Plotting Autocorrelation

We plot the autocorrelation of the chain, ideally quickly decreasing to zero (the chain itself looks like white noise).
Let’s see how to use Gibb sampling to compute conditional mean and variance.
Suppose we are interested in sampling
Y    N(μ,Σ)Y\;\sim\;N(\mu,\Sigma)
Let’s use Gibbs, since sampling from the conditionals is straightforward:
(yAyB)    N((μAμB),[ΣAAΣABΣBAΣBB])\binom{y_A}{y_B}\;\sim\;N\bigg(\binom{\mu_A}{\mu_B}, \Big[\begin{array}{cc}\Sigma_{AA}&\Sigma_{AB}\\\Sigma_{BA}&\Sigma_{BB}\end{array} \Big]\bigg)
then the conditionals mean and variance are given by
μ=μA+ΣABΣBB1(yBμB)Σ=ΣAAΣABΣBB1ΣBA\mu^*=\mu_A+\Sigma_{AB}\Sigma_{BB}^{-1}(y_B-\mu_B)\\\Sigma^*=\Sigma_{AA}-\Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}

2.2.c Comparison

We run multiple choices and compare them using the Gelman-Rubin diagnostic.
The Gelman-Rubin diagnostic is:
run mm chains of length 2n2n, starting over dispersed points, discard first nn
compute the between variance:
Bn=i=1m(xˉixˉ)2n1\frac{B}{n}=\sum^m_{i=1}\frac{(\bar{x}_i-\bar{x})^2}{n-1}
and the within variance
W=1mi=1m(1n1j=1nxijxˉi)2W=\frac{1}{m}\sum_{i=1}^m \bigg(\frac{1}{n-1}\sum^n_{j=1}x_{ij}-\bar{x}_i\bigg)^2
and take the estimated variance to be
σ^+2=n1nW+Bn\hat{\sigma}^2_+=\frac{n-1}{n}W+\frac{B}{n}
However the shortcomings of this idea is to only consider the effect of auto-correlations for the identity function.
We need to bound autocorrelations for a huge class of functions if we want to converge in total variation

2.2.d Coupling

We run m pairs of chains to simulate two dependent Markov chains, and run the chains until they meet
We have: (X,Y)Γ(ν0,ν1)(X,Y)\sim\Gamma(\nu_0,\nu_1)
so that marginally
Xν0  and  Yν1X\sim\nu_0\;and\;Y\sim\nu_1
We define the total variation metric between distributions:
ν0ν1TV=supAν0(A)ν1(A)||\nu_0-\nu_1||_{TV}=sup_A|\nu_0(A)-\nu_1(A)|
The coupling inequality states that
P[XY]ν0ν1TVP[X \neq Y]\geq||\nu_0-\nu_1||_{TV}
Thus the idea is to simulate from two dependent Markov chains at different points
Xnν0Kn  and  Ynν1KnX_n\sim\nu_0K^n\;and\;Y_n\sim\nu_1K^n
and run the chain until they meet
For exemple take:
XfWXU(0,f(X))X\sim f \\W|X\sim U(0,f(X))
if W<g(X)W < g(X) output (X,X)(X, X)
else sample YgY^* \sim g and take WYU(0,g(Y))W^*|Y^*\sim U(0, g(Y^*)) until W>f(Y)W^* > f(Y^*) and output (X,Y)(X, Y^*)

3. Nontrivial models

3.1 Linear regression

We see how Bayesian linear regression
zN×1=WN×dβd×1+ϵN×1with  ϵN(0,σ2I)z_{N×1}=W_{N×d}\beta_{d×1}+\epsilon_{N×1}\\with\;\epsilon\sim N(0,\sigma^2I)
easily leads to ridge estimate
β^ridge=E[βW,σ2,z,τ2]=(WTW+1τ2I)1WTz\hat{\beta}_{ridge}=\mathbb{E}[\beta|W,\sigma^2,z,\tau^2]=\Big(W^TW+\frac{1}{\tau^2}I\Big)^{-1}W^Tz
where z is the response, W is the matrix, β is the vector of regression coefficients, and ϵ is the vector of random error.
The usual least square estimate is
β^OLS=(WTW)1WTz\hat{\beta}_{OLS}=(W^TW)^{-1}W^Tz
which is also the MLE for β. The MLE for σ2 is given by
σ^2=1N(ZWβ^)T(ZWβ^)=zT(IdW(WTW)1WT)z2\hat{\sigma}^2=\frac{1}{N}(Z-W\hat{\beta})^T(Z-W\hat{\beta})=||z^T(I_d-W(W^TW)^{-1}W^T)z||^2
These work fine when N >> d, but if d is big, using shrinkage make sense. So we consider the Bayesian hierarchical model
zN(Wβ,σ2)βN(0,τ2σ2I)z\sim N(W\beta,\sigma^2)\\\beta\sim N(0,\tau^2\sigma^2I)
Thus, we have
2log(L(zβ,w,σ2)p(β))1σ2(zWβT)(zWβ)+1σ2τ2βTββT[1σ2WTW+1σ2τ2I]2βT(WTzσ2)-2log(L(z|\beta,w,\sigma^2)p(\beta))\propto \frac{1}{\sigma^2}(z-W\beta^T)(z-W\beta)+\frac{1}{\sigma^2\tau^2}\beta^T\beta\\\propto\beta^T\Big[\frac{1}{\sigma^2}W^TW+\frac{1}{\sigma^2\tau^2}I\Big]-2\beta^T\Big(\frac{W^Tz}{\sigma^2}\Big)
so
βW,σ2,τ2N(Vμ,V)\beta|W,\sigma^2,\tau^2\sim N(V\mu^*,V)
with
V=(1σ2WTW+1σ2τ2I)1μ=1σ2WTzV = \Big(\frac{1}{\sigma^2}W^TW+\frac{1}{\sigma^2\tau^2I}\Big)^{-1} \\ \mu^*=\frac{1}{\sigma^2}W^Tz
Thus the posterior mean is:
β^ridge=E[βW,σ2,z,τ2]=Vμ=(WTW+1τ2I)1WTz\hat{\beta}_{ridge}=\mathbb{E}[\beta|W,\sigma^2,z,\tau^2]=V\mu^*=\Big(W^TW+\frac{1}{\tau^2}I\Big)^{-1}W^Tz
This is the usual ridge regression estimate, which solves
β^ridge=maxβWβz2τ2β2\hat{\beta}_{ridge}=max_{\beta}-||W\beta-z||^2-\tau^{-2}||\beta||^2

3.2 Tikhonov regularisation

More generally we have Tikhonov regularization
β^ridge=(WTW+A1)1WTz\hat{\beta}_{ridge}=(W^TW+A^{-1})^{-1}W^Tz
where A symmetric, positive definite.
We regularise because:
Adding a bias may improve our estimation of a vector with more than 2 dimensions (see JS estimator)
When p > N, least square can’t be done, but regularised solution can
We see in that its much harder to get A with Empirical Bayes than Full Bayes (in particular when β doesn’t follow a Normal distribution).
We maximise the marginal likelihood (in term of τ2\tau^2)
Pick a prior structure:
βσ2,τ2N(0,σ2τ2I)σ2InvGamma(a,b)\beta|\sigma^2,\tau^2\sim N(0,\sigma^2\tau^2I)\\\sigma^2\sim InvGamma(a,b)
which can be reframed βN(0,σ2τ2I)ZN(Wβ,σ2I)\beta\sim N(0,\sigma^2\tau^2I)\\Z\sim N(W\beta,\sigma^2I)
so:
ZN(0,σ2(τ2WWT+I))Z\sim N(0,\sigma^2(\tau^{-2}WW^T+I))
We compute easily the likelihood: p(zσ2τ2,W)p(z|\sigma^2\tau^2,W)
Now we just need to multiply by the prior density for σ2\sigma^2 and integrate over σ2\sigma^2. This is easy since the function is the kernel of an inverse-gamma distribution.
Then we pick τ\tau according to: maxτ2p(zW,σ2,β,τ2)p(β,σ2)dβdσ2max_{\tau^2}\int p(z|W,\sigma^2,\beta,\tau^2)p(\beta,\sigma^2)d\beta d\sigma^2
If β\beta were not coming from a Normal distribution, this problem could quickly become intractable (above all because β\beta is high-dimensional).
Full Bayes To estimate A, we put a prior on τ . What possibilities?
Discrete Uniform on a set of grid points
Uniform on an interval
Half-Cauchy
Inverse Gamma
We use Half-Cauchy: it requires no hyperparameters and it is reasonably uninformative (it let the data speak for itself). The full model is:
βσ2,τ2N(0,σ2τ2I)σ2InvGamma(a,b)p(τ)12π11+τ2\beta|\sigma^2,\tau^2\sim N(0, \sigma^2\tau^2I)\\\sigma^2\sim InvGamma(a,b)\\p(\tau)\sim \frac{1}{2\pi}\frac{1}{1+\tau^2}
This can be done using Gibbs sampling
βτ2,σ2,W,zσ2τ2,β,W,zτ2σ2,β2\beta|\tau^2,\sigma^2,W,z\\\sigma^2|\tau^2,\beta,W,z\\\tau^2|\sigma^2,\beta^2
The first two distribution are easy to sample from. The third distribution is well known and we can directly sample it with Metropolis-Hastings

3.3 Conjugate prior

Another option to solve Bayesian linear regression is using a conjugate prior.
We see how Zellner’s g prior gives us a posterior with a mean that is a shrunken factor of the MLE.
We have:
β,σ2NΓ1(μ,V,a,b)\beta,\sigma^2\sim N\Gamma^{-1}(\mu,V,a,b)
which is equivalent to:
σ2InvGamma(a,b)βσ2N(μ,σ2V)\sigma^2\sim InvGamma(a,b)\\\beta|\sigma^2\sim N(\mu,\sigma^2 V)
so:
p(β,σ2)(σ2)p/2(σ2)a1e1σ2(b+(βμ)TV1(βμ))p(\beta,\sigma^2)\propto (\sigma^2)^{-p/2}(\sigma^2)^{-a-1}e^{-\frac{1}{\sigma^2}(b+(\beta-\mu)^TV^{-1}(\beta-\mu))}
And if
ZN(Wβ,σ2I)Z\sim N(W\beta,\sigma^2I)
then the posterior is:
β,σ2z,WNΓ1(μ,V,a,b)\beta,\sigma^2|z, W\sim N\Gamma^{-1}(\mu^*,V^*,a^*,b^*)
where
μ=(V1+WTW)1(V1μ+WTz)V=(V1+WTW)1a=a+N/2b=b+1/2[μTV1μ+zTz(μ)T(V)1μ]\mu^*=(V^{-1}+W^TW)^{-1}(V^{-1}\mu+W^Tz)\\V^*=(V^{-1}+W^TW)^{-1}\\a^*=a+N/2\\b^*=b+1/2\Big[\mu^TV^{-1}\mu+z^Tz-(\mu^*)^T(V^*)^{-1}\mu^*\Big]
If we use the Zellner’s g prior:
p(βσ2,g)N(0,σ2g(WTW)1)p(σ2)1σ2p(\beta|\sigma^2,g)\sim N(0,\sigma^2g(W^TW)^{-1})\\p(\sigma^2)\sim \frac{1}{\sigma^2}
and we find that the posterior is still Normal InvGamma, but his mean has become
μ=gg+1(WTW)1WTz=gg+1β^\mu^*=\frac{g}{g+1}(W^TW)^{-1}W^Tz=\frac{g}{g+1}\hat{\beta}

3.4 Bayesian variable selection

We saw that for Ridge or l1l_1 penalisation the problem was
maxβWβz2+λβ1max_{\beta}||W\beta-z||^2+\lambda||\beta||_1
In high-dimension, Lasso or l0l_0 penalisation allows us to select features automatically. The ideal problem in this case is:
maxβWβz2λβ0max_{\beta}||W\beta-z||^2-\lambda||\beta||_0
We still need to select λ with AIC (λ = 1) or BIC (λ = 2log(n)).
If we think of variable selection as hypothesis testing, we can compute the Bayes factor for each variable.
We can think of variable selection as:
γj={1  if  βj0:H1j0  if  βj=0:H0j\gamma_j=\bigg\{\begin{array}{cc} 1\;if\;\beta_j\neq0:H_{1j}\\0\;if\;\beta_j=0:H_{0j}\end{array}
Bayesian testing in case of one hypothesis is
p[H0z]=11+1qqp(zH1)p(zH0)p[H_0|z]=\frac{1}{1+\frac{1-q}{q}\frac{p(z|H_1)}{p(z|H_0)}}
We can then compute p(z|H1) and p(z|H0) to find the Bayes factor.

Final words

Thank you for reading, it has been quite a journey and I hope that I successfully put those complex concepts in a short and easy form. There are some nice state of the art here and here that I leave you to exploring. See you on a next post!