Hierarchical gaussian models, empirical bayes and shrinkage

Overview

This article is the second of the Bayesian series. Previous episode here.

You probably already know the Bayes theorem or stumble upon Naive Bayes when comparing machine learning models. Maybe, like me, you feel that you barely saw the shores of the continent that is the Bayesian land and want to set foot in it.

I wrote this article from a Stanford course with the intent for you to understand the relationships between all Bayesian concepts without having to go through an entire classroom (like I did). As a matter of readability, I hide most of the maths in dropdowns. This way, you can follow the logic easily and come back to deep dive into formulas later.

So far we have dealt with Bayesian analysis of very simple models, mainly using conjugate priors, with few parameters to estimate. In the coin tossing example, we had just one parameter, the probability of heads. In the Gaussian model, we had two parameters, the mean μμμ﻿ and variance σ2σ^2σ2﻿. However, in modern applications one typically must estimate many parameters, using more complicated hierarchical models.

Overview

1. Empirical Bayes

1.1 Estimation of the normal mean

1.2 Admissibility and Empirical Bayes

1.3 The James-Stein estimator

1-4 Full Bayes inference

2. Markov Chain & Monte Carlo

2.1 Gibbs sampling

2.2 Metropolis-Hastings algorithm

2.2.a Asymptotic variance

2.2.b Plotting Autocorrelation

2.2.c Comparison

2.2.d Coupling

3. Nontrivial models

3.1 Linear regression

3.2 Tikhonov regularisation

3.3 Conjugate prior

3.4 Bayesian variable selection

Final words

1. Empirical Bayes

1.1 Estimation of the normal mean

Suppose that we have NNN﻿ observations each from independent Gaussians all with same variance 1 but with different means

yi  ∼indN(μi,1)y_i \;\sim^{ind}N(\mu_i,1)yi​∼indN(μi​,1)﻿​

This is a high-dimensional model: there are as many parameters as there are data. Another way to write it with N-vectors:

y  ∼  N(Iμ,I)y\;\sim\;N(I\mu,I)y∼N(Iμ,I)﻿​

Suppose we want to estimate the vector μ. The natural “frequentist” thing to do is just to use the MLE, which in this case is simply

μ^=y\hat{\mu}=yμ^​=y﻿​

On the other hand, if we performed a Bayesian analysis, it would be natural to choose the prior

μi  ∼iid  N(m,τ2)\mu_i\;\sim^{iid}\;N(m,\tau^2)μi​∼iidN(m,τ2)﻿​

We compute the difference of risk between the frequentist MLE and the Bayes estimator

Frequentist

The risk of the MLE (yiy_iyi​﻿) is the expectation of its squared error loss

E[∣∣μ−μ^∣∣22]=∑i=1NE[(μi−μ^i)2]=∑i=1NE[(μi−yi)2]=NE[||\mu-\hat{\mu}||^2_2]=\sum^N_{i=1}E[(\mu_i-\hat{\mu}_i)^2]=\sum^N_{i=1}E[(\mu_i-y_i)^2]=NE[∣∣μ−μ^​∣∣22​]=∑i=1N​E[(μi​−μ^​i​)2]=∑i=1N​E[(μi​−yi​)2]=N﻿​

since the yiy_iyi​﻿ are all independent normals with mean μiμ_iμi​﻿ and variance 1.

Bayesian:

The Bayes estimator of each μiμ_iμi​﻿ is

E[μi∣y,m,τ2]=yi1+τ−2E[\mu_i|y,m,\tau^2]=\frac{y_i}{1+\tau^{-2}}E[μi​∣y,m,τ2]=1+τ−2yi​​﻿​

This is given by the following theorem, when we consider m = 0.

If y∼N(μ,Σ)y \sim N(\mu, \Sigma)y∼N(μ,Σ)﻿ and μ∼N(m,Ψ)\mu \sim N(m, \Psi)μ∼N(m,Ψ)﻿ then y∣μ∼N(Vξ,V)y|\mu \sim N(V\xi, V)y∣μ∼N(Vξ,V)﻿​

Where V=(Ψ−1+Σ−1)−1,  ξ=(Ψ−1m+Σ−1y)V = (\Psi^{-1}+\Sigma^{-1})^{-1},\;\xi = (\Psi^{-1}m + \Sigma^{-1}y)V=(Ψ−1+Σ−1)−1,ξ=(Ψ−1m+Σ−1y)﻿​

So the posterior means shrinks the MLE. Let’s compute the frequentist risk of this estimator, which is needed to obtain the Bayes risk.

E(yiτ21+τ2−μi)2=var(yiτ21+τ2)+bias2(yiτ21+τ2)=τ4(1+τ2)2+μi2(1+τ2)2E(\frac{y_i\tau^2}{1+\tau^2}-\mu_i)^2=var(\frac{y_i\tau^2}{1+\tau^2})+bias^2(\frac{y_i\tau^2}{1+\tau^2})=\frac{\tau^4}{(1+\tau^2)^2}+\frac{\mu_i^2}{(1+\tau^2)^2}E(1+τ2yi​τ2​−μi​)2=var(1+τ2yi​τ2​)+bias2(1+τ2yi​τ2​)=(1+τ2)2τ4​+(1+τ2)2μi2​​﻿​

So the Bayes risk of this estimator is

∑iEm,τ2[τ4(1+τ2)2+μi2(1+τ2)2]=NEm,τ2[τ4(1+τ2)2+μi2(1+τ2)2]=Nτ2τ2+1\sum_i E_{m,\tau^2}[\frac{\tau^4}{(1+\tau^2)^2}+\frac{\mu_i^2}{(1+\tau^2)^2}]=NE_{m,\tau^2}[\frac{\tau^4}{(1+\tau^2)^2}+\frac{\mu_i^2}{(1+\tau^2)^2}]=N\frac{\tau^2}{\tau^2+1}∑i​Em,τ2​[(1+τ2)2τ4​+(1+τ2)2μi2​​]=NEm,τ2​[(1+τ2)2τ4​+(1+τ2)2μi2​​]=Nτ2+1τ2​﻿​

So the difference between the risk of the MLE and the Bayes estimator is

N−Nτ2τ2+1=Nτ2+1N-N\frac{\tau^2}{\tau^2+1}=\frac{N}{\tau^2+1}N−Nτ2+1τ2​=τ2+1N​﻿​

Rfrequentist−RBayesian=Nτ2+1R_{frequentist}-R_{Bayesian}=\frac{N}{\tau^2+1}Rfrequentist​−RBayesian​=τ2+1N​﻿, which can be very large if τ is small.

1.2 Admissibility and Empirical Bayes

Obviously we want estimators that have low risk. For any loss function LLL﻿ we know that:

Frequentist deal with the fact that θ\thetaθ﻿ is unknown, the risk still depends on θθθ﻿​

R(θ,δ)=∫L(θ,δ(y))p(y∣θ))dyR(\theta,\delta)=\int L(\theta,\delta(y))p(y|\theta))dyR(θ,δ)=∫L(θ,δ(y))p(y∣θ))dy﻿​

Bayesians deal with this issue by integrating a second time over θ\thetaθ﻿ with respect to the prior distribution.

Rˉ=∫∫L(θ,δ(y))p(y∣θ)p(θ∣ξ)dydθ=Eξ[Eθ[L(θ,δ(y))]]\bar{R}=\int \int L(\theta,\delta(y))p(y|\theta)p(\theta|\xi)dyd\theta=E_{\xi}[E_{\theta}[L(\theta,\delta(y))]]Rˉ=∫∫L(θ,δ(y))p(y∣θ)p(θ∣ξ)dydθ=Eξ​[Eθ​[L(θ,δ(y))]]﻿​

Where ξ are prior hyperparameters. However frequentists don’t use prior, so instead we can use admissibility and minimax.

An estimator δ∗\delta^*δ∗﻿ is inadmissible if there exists another estimator δ\deltaδ﻿ for which:

R(θ,δ)≤R(θ,δ∗)R(\theta,\delta)\leq R(\theta,\delta^*)R(θ,δ)≤R(θ,δ∗)﻿​

So an estimator is admissible if it is not inadmissible. δ\deltaδ﻿ dominate δ∗\delta^*δ∗﻿ if R(θ, δ)≤R(θ, δ∗)R(θ, δ)≤R(θ, δ^*)R(θ, δ)≤R(θ, δ∗)﻿ for every θ\thetaθ﻿.

1.3 The James-Stein estimator

If we don’t know how to choose the parameter τ\tauτ﻿ of μ\muμ﻿, we try to estimate μ\muμ﻿ using empirical Bayes via the James-Stein estimator.

μ^(JS)=(1−N−2∣∣y∣∣2)  y\hat{\mu}^{(JS)}=(1-\frac{N-2}{||y||^2})\;yμ^​(JS)=(1−∣∣y∣∣2N−2​)y﻿​

We prove that the JS estimator has a larger risk than the Bayes estimator (indeed it must be), but it dominates the MLE everywhere if the dimension is large enough (N ≥ 2).

So we have

y∣μ  ∼  N(μ,I),  μ∣τ2  ∼  N(0,τ2I)y|\mu\;\sim\;N(\mu,I),\;\mu|\tau^2\;\sim \; N(0,\tau^2I)y∣μ∼N(μ,I),μ∣τ2∼N(0,τ2I)﻿​

and

y  ∼  N(0,(1+τ2)I).y\;\sim\;N(0,(1+\tau^2)I).y∼N(0,(1+τ2)I).﻿​

Thus

S=∣∣y∣∣2  ⟹  S  ∼  (τ2+1)χN2S=||y||^2 \implies S\;\sim\;(\tau^2+1)\chi^2_NS=∣∣y∣∣2⟹S∼(τ2+1)χN2​﻿​

where χN2\chi^2_NχN2​﻿ has N degrees of freedom and

E[N−2S]=1τ2+1E[\frac{N-2}{S}]=\frac{1}{\tau^2+1}E[SN−2​]=τ2+11​﻿​

With our JS estimator defined as

μ^(JS)=(1−N−2S)  y\hat{\mu}^{(JS)}=(1-\frac{N-2}{S})\;yμ^​(JS)=(1−SN−2​)y﻿​

the integrated risk of our JS estimator is:

R(μ,μ^(JS))=Eμ[∣∣μ−μ^(JS)∣∣2]=Eμ[∣∣μ−y−N−2Sy∣∣2]=...=Nτ2τ2+1+2τ2+1R(\mu,\hat{\mu}^{(JS)})=E_{\mu}[||\mu-\hat{\mu}^{(JS)}||^2]=E_{\mu}[||\mu-y-\frac{N-2}{S}y||^2]\\=...\\=N\frac{\tau^2}{\tau^2+1}+\frac{2}{\tau^2+1}R(μ,μ^​(JS))=Eμ​[∣∣μ−μ^​(JS)∣∣2]=Eμ​[∣∣μ−y−SN−2​y∣∣2]=...=Nτ2+1τ2​+τ2+12​﻿​

This is larger than the Bayesian risk, found in 1.1, and we have

Rˉ(JS)Rˉ=1+2Nτ2\frac{\bar{R}^{(JS)}}{\bar{R}}=1+\frac{2}{N\tau^2}RˉRˉ(JS)​=1+Nτ22​﻿​

Thus JS very similar to Bayes estimator when N is large.

The proof also show that

R(μ,μ^(JS))=N−Eμ[(N−2)2S]R(\mu,\hat{\mu}^{(JS)})=N-E_\mu[\frac{(N-2)^2}{S}]R(μ,μ^​(JS))=N−Eμ​[S(N−2)2​]﻿​

So the JS estimator dominates the MLE if N ≥ 2.

See the full proof here.

So why one would use the MLE is high-dimension? Because:

The JS estimator is biased, since the MLE is the minimum variance unbiased estimator (Rao-Blackwell theorem). So the JS reduces its variance but adds bias: good for point estimation but challenging for frequentist interval estimation.

The MLE is admissible in 1 and 2-dimensions, but even in higher dimensions, if we focus on a single “outlier” entry, the MLE has a better MSE (see why below).

In high dimension, suppose we generate values from

yi  ∼  N(μi,1)  for  i=1,...,11y_i\;\sim\;N(\mu_i,1)\;for\;i=1,...,11yi​∼N(μi​,1)fori=1,...,11﻿​

We put

μi=(5,−1,−0.75,...,0.75,1)\mu_i=(5,-1,-0.75,...,0.75,1)μi​=(5,−1,−0.75,...,0.75,1)﻿​

For each simulation replicate, we estimate μ by the MLE and by JS. We find:

1-4 Full Bayes inference

We could conclude at this point that one should always use empirical Bayes to choose a prior. However, when it comes to estimating a parameter, the Bayesian approach is to put a prior on it.

Thus if our problem is

y  ∼  N(μ,I)μ  ∼  N(0,τ2I)τ2  ∼  p(τ2)y\;\sim\;N(\mu,I)\\\mu\;\sim\;N(0,\tau^2I)\\\tau^2\;\sim\;p(\tau^2)y∼N(μ,I)μ∼N(0,τ2I)τ2∼p(τ2)﻿​

now our problem is to estimate p(τ2)p(\tau^2)p(τ2)﻿. Since we want to estimate τ2\tau^2τ2﻿ from our data, we have to place an objective prior on τ2\tau^2τ2﻿: p(τ2)=1p(τ^2)=1p(τ2)=1﻿​

So we compute the posterior of μ to find out that its posterior expectation is challenging to evaluate, although our initial model is simple.

We now have:

y  ∼  N(μ,I)μ  ∼  N(0,τ2I)τ2  ∼  p(τ2)=1y\;\sim\;N(\mu,I)\\\mu\;\sim\;N(0,\tau^2I)\\\tau^2\;\sim\;p(\tau^2)=1y∼N(μ,I)μ∼N(0,τ2I)τ2∼p(τ2)=1﻿​

Let’s compute the posterior

p(μ,τ2∣y)∝p(y∣μ)p(μ∣τ2)p(τ2)=(2π)−Ne−12(y−μ)′I(y−μ)(τ2)−N/2e−12μ′(τ2I)−1μ=(2π)−N(τ2)−N/2e−12(μ−(1+τ−2)−1y)′(1+τ−2)I(μ−(1+τ−2)−1y)e−12y′(1−(1+τ2)−1Iy)p(\mu,\tau^2|y)\propto p(y|\mu)p(\mu|\tau^2)p(\tau^2) = (2\pi)^{-N}e^{-\frac{1}{2}(y-\mu)'I(y-\mu)}(\tau^2)^{-N/2}e^{-\frac{1}{2}\mu'(\tau^2I)^{-1}\mu}\\=(2\pi)^{-N}(\tau^2)^{-N/2}e^{-\frac{1}{2}(\mu-(1+\tau^{-2})^{-1}y)'(1+\tau^{-2})I(\mu-(1+\tau^{-2})^{-1}y)}e^{-\frac{1}{2}y'(1-(1+\tau^2)^{-1}Iy)}p(μ,τ2∣y)∝p(y∣μ)p(μ∣τ2)p(τ2)=(2π)−Ne−21​(y−μ)′I(y−μ)(τ2)−N/2e−21​μ′(τ2I)−1μ=(2π)−N(τ2)−N/2e−21​(μ−(1+τ−2)−1y)′(1+τ−2)I(μ−(1+τ−2)−1y)e−21​y′(1−(1+τ2)−1Iy)﻿​

I spare you the lengthy proof of this result, which would need an article in itself.

We need some alternative way to do integration that doesn’t involve actually computing all of the integrals analytically.

2. Markov Chain & Monte Carlo

We still have the same problem. Now we know that getting the posterior is hard if we can’t find the normalising constant through some known distribution.

Another choice is the half Cauchy prior on τ

p(τ)∝11+τ2p(\tau)\propto\frac{1}{1+\tau^2}p(τ)∝1+τ21​﻿​

We choose to have an Inverse Gamma prior on τ2\tau^2τ2﻿ and a Normal prior on mmm﻿.

We show that computing the posterior become very ugly

The half Cauchy prior is not a conjugate prior, so now we consider:

y  ∼  N(μ,σ2I)μ  ∼  N(m,σ2τ2I)p(τ)∝11+τ2y\;\sim\;N(\mu,\sigma^2I)\\\mu\;\sim\;N(m,\sigma^2\tau^2I)\\p(\tau)\propto\frac{1}{1+\tau^2}y∼N(μ,σ2I)μ∼N(m,σ2τ2I)p(τ)∝1+τ21​﻿​

The posterior can be written

p(m,τ2,σ2,μ∣y)∝p(y∣μ,σ2)p(μ∣σ2,τ2,m)p(σ2)p(τ2)p(m)p(m,\tau^2,\sigma^2,\mu|y)\propto p(y|\mu,\sigma^2)p(\mu|\sigma^2,\tau^2,m)p(\sigma^2)p(\tau^2)p(m)p(m,τ2,σ2,μ∣y)∝p(y∣μ,σ2)p(μ∣σ2,τ2,m)p(σ2)p(τ2)p(m)﻿​

We try to find the normalizing constant p(y)p(y)p(y)﻿:

∫∣2πσ2I∣−1/2e−12(y−μ)T(σ2I)−1(y−μ)∣2πσ2τ2I−1/2∣e−12(μ−m)T(σ2τ2I)−1(y−m)dμ=(2πσ2)−N/2(τ2)−N/2e(y−m∗)(Σ∗)−1(y−m∗)\int|2\pi\sigma^2I|^{-1/2}e^{-\frac{1}{2}(y-\mu)^T(\sigma^2I)^{-1}(y-\mu)}|2\pi\sigma^2\tau^2I^{-1/2}|e^{-\frac{1}{2}(\mu-m)^T(\sigma^2\tau^2I)^{-1}(y-m)}d\mu\\=(2\pi\sigma^2)^{-N/2}(\tau^2)^{-N/2}e^{(y-m^*)(\Sigma^*)^{-1}(y-m^*)}∫∣2πσ2I∣−1/2e−21​(y−μ)T(σ2I)−1(y−μ)∣2πσ2τ2I−1/2∣e−21​(μ−m)T(σ2τ2I)−1(y−m)dμ=(2πσ2)−N/2(τ2)−N/2e(y−m∗)(Σ∗)−1(y−m∗)﻿​

However, in high-dimensions we don’t really care about the whole posterior, but more about some functional of it, like the posterior expectations of a function f:

∫f(θ)p(y∣θ)p(θ)p(y)dθ\int\frac{f(\theta)p(y|\theta)p(\theta)}{p(y)}d\theta∫p(y)f(θ)p(y∣θ)p(θ)​dθ﻿​

It turns out this integral can be approximated using Markov chain Monte Carlo (MCMC). It consists in sampling the desired distribution by recording states from the Markov chain, which has the desired distribution as its equilibrium distribution.

Let’s see the difference with ordinary Monte Carlo.

Suppose we have Y∼GY \sim GY∼G﻿ and we want to compute the quantity E[log(Y)]\mathbb{E}[log(Y)]E[log(Y)]﻿​

One idea in to take samples from g(θ), say

θ1,...,θN  ∼iid  G\theta_1,..., \theta_N\;\sim^{iid}\;Gθ1​,...,θN​∼iidG﻿​

and then use

E[log Y]undefined=1n∑j=0n−1log(θj)\widehat{\mathbb{E}[log\,Y]}=\frac{1}{n}\sum^{n-1}_{j=0}log(\theta_j)E[logY]​=n1​∑j=0n−1​log(θj​)﻿​

If we could sample Y ∼ G, it would be just ordinary Monte Carlo.

A Markov chain is a sequence of random variable ignoring the whole history (independent) except for the last event:

p(XN∣X0,...,XN−1)=p(XN∣XN−1)p(X_N|X_0,...,X_{N-1})=p(X_N|X_{N-1})p(XN​∣X0​,...,XN−1​)=p(XN​∣XN−1​)﻿​

We hop from one state nnn﻿ to state n+1n+1n+1﻿ via a transition kernel KKK﻿. Some KKK﻿ have invariant distribution.

Let’s see how it suggests the Gibs sampling model to construct K.

Xn∈{0,1}  and  α,β∈[0,1]Xn∣(Xn−1=0)={0  with  proba  1−α1  with  proba  αXn∣(Xn−1=1)={0  with  proba  β1  with  proba  1−βX_n\in\{0,1\}\;and\;\alpha,\beta\in[0,1]\\X_n|(X_{n-1}=0)= \Bigg\{ \begin{array}{ll}0\;with\;proba\;1-\alpha \\ 1\;with\;proba\;\alpha \end{array}\\X_n|(X_{n-1}=1)= \Bigg\{ \begin{array}{ll}0\;with\;proba\;\beta \\ 1\;with\;proba\;1-\beta \end{array}Xn​∈{0,1}andα,β∈[0,1]Xn​∣(Xn−1​=0)={0withproba1−α1withprobaα​Xn​∣(Xn−1​=1)={0withprobaβ1withproba1−β​﻿​

so our transition matrix K is

K=[1−ααβ1−β]K=\Bigg[ \begin{array}{cc}1-\alpha & \alpha\\ \beta & 1-\beta \end{array} \Bigg]K=[1−αβ​α1−β​]﻿​

In high dimension, K is no longer a matrix but an operator mapping function. This way, the density

X1∣X0  ∼  PX_1|X_0\;\sim\;PX1​∣X0​∼P﻿​

can be written

∫p(x)k(x,y)dx=pˉ(y)\int p(x)k(x,y)dx=\bar{p}(y)∫p(x)k(x,y)dx=pˉ​(y)﻿​

How can we use Markov Chain to compute expectations with respect to the posterior?

Some K have an invariant distribution: μK=μ\mu K=\muμK=μ﻿​

If μ\muμ﻿ has density ppp﻿, this means

∫p(x)k(x,y)dx=p(y)\int p(x)k(x,y)dx=p(y)∫p(x)k(x,y)dx=p(y)﻿​

If the invariant measure μμμ﻿ exists and is unique then:

lim⁡n→∞νKn=μ\lim_{n \rightarrow \infty}\nu K^n=\mulimn→∞​νKn=μ﻿​

If we start at the state ν, the distribution will look more and more like the invariant distribution.

Thus if we could make the posterior our invariant distribution, we could sample from it using Markov chains:

if  X0  ∼  ν  and  Xn∣Xn−1  ∼  K(Xn−1,.)\;X_0\;\sim\;\nu\;and\;X_n|X_{n-1}\;\sim\;K(X_{n-1},.)X0​∼νandXn​∣Xn−1​∼K(Xn−1​,.)﻿, 

then 1n∑k=0n−1ϕ(Xk)→n→∞  i.p.∫ϕ(x)p(x)dx\frac{1}{n}\sum^{n-1}_{k=0}\phi(X_k)\rightarrow^{n\rightarrow\infty\;i.p.}\int\phi(x)p(x)dxn1​∑k=0n−1​ϕ(Xk​)→n→∞i.p.∫ϕ(x)p(x)dx﻿​

if KKK﻿ has invariant density ppp﻿. That means the empirical averages converge to expected value we are looking for.

2.1 Gibbs sampling

We want to construct KKK﻿ for arbitrary distributions. We need to compute the posterior without knowing the normalising constant p(y)p(y)p(y)﻿.

In Gibbs sampling the idea is to break the problem of sampling from the high-dimensional joint distribution into a series of samples from low-dimensional conditional distributions.

We illustrate the algorithm here

Find a KKK﻿ with invariant distribution p(θ∣y)p(θ|y)p(θ∣y)﻿.

Start somewhere X0 ∼ νX_0 ∼ νX0​ ∼ ν﻿.

Simulate forward a lot of steps.

Throw out B steps (burn-in, intermediary steps, far from the convergence).

Average the rest.

Here steps 1 to 3 look like this:

Initialize  θ(0)∈RD  and  number  of  sample  NInitialize\;\theta^{(0)}\in R^{D}\;and\;number\;of\;sample\;NInitializeθ(0)∈RDandnumberofsampleN﻿​

for  i=0  to  N−1for\;i=0\;to\;N-1fori=0toN−1﻿​

•  θ0(i+1)∼  p(θ1∣θ2(i),...,θD(i))•\;\theta^{(i+1)}_0\sim\;p(\theta_1|\theta_2^{(i)},...,\theta^{(i)}_D)•θ0(i+1)​∼p(θ1​∣θ2(i)​,...,θD(i)​)﻿​

.........﻿​

•  θj(i+1)∼  p(θj∣θ1(i+1),...,θj−1(i+1),θj+1(i),...,θD(i))•\;\theta^{(i+1)}_j\sim\;p(\theta_j|\theta_1^{(i+1)},...,\theta^{(i+1)}_{j-1},\theta^{(i)}_{j+1},...,\theta^{(i)}_D)•θj(i+1)​∼p(θj​∣θ1(i+1)​,...,θj−1(i+1)​,θj+1(i)​,...,θD(i)​)﻿​

•  θD(i+1)∼  p(θD∣θ1(i+1),...,θD−1(i+1))•\;\theta^{(i+1)}_D\sim\;p(\theta_D|\theta_1^{(i+1)},...,\theta^{(i+1)}_{D-1})•θD(i+1)​∼p(θD​∣θ1(i+1)​,...,θD−1(i+1)​)﻿​

return  ({θ(i)}i=0N−1)return \;(\{\theta^{(i)}\}^{N-1}_{i=0})return({θ(i)}i=0N−1​)﻿​

as this has the invariant distribution p(θ).p(θ).p(θ).﻿​

Here is an application of Gibbs sampling to a Probit model:

yi ∼iid Bern(Φ(θ))=Bern(p(Z ≤ θ))y_i ∼_{iid} Bern(Φ(θ))=Bern(p(Z ≤ θ))yi​ ∼iid​ Bern(Φ(θ))=Bern(p(Z ≤ θ))﻿​

so:

yi=I[zi>0]zi  ∼  N(θ,1)y_i=\mathbb{I}_{[z_i>0]}\\z_i\;\sim\;N(\theta,1)yi​=I[zi​>0]​zi​∼N(θ,1)﻿​

thus

zi∣yi  ∼  {N(0,∞)(θ,1)  if  yi=1N(−∞,0](θ,1)  if  yi=0z_i|y_i\;\sim\;\bigg\{\begin{array}{cc}N_{(0,\infty)}(\theta,1)\;if\;y_i=1\\N_{(-\infty,0]}(\theta,1)\;if\;y_i=0 \end{array}zi​∣yi​∼{N(0,∞)​(θ,1)ifyi​=1N(−∞,0]​(θ,1)ifyi​=0​﻿​

now if θ  ∼  N(0,τ2)\theta\;\sim\;N(0,\tau^2)θ∼N(0,τ2)﻿ then θ∣z  ∼  N(nzˉτ−2+n,1τ−2+n)\theta|z\;\sim\;N(\frac{n\bar{z}}{\tau^{-2}+n},\frac{1}{\tau^{-2}+n})θ∣z∼N(τ−2+nnzˉ​,τ−2+n1​)﻿​

Thus, we can use Gibbs to do computations with this model.

Tanner and Wong (87)

Albert and Chib (1993) data-augmentation

A nice implementation can be found here:

mikhailiuk/medium

2.2 Metropolis-Hastings algorithm

If one of the Gibbs conditionals is not conjugate, a Metropolis-Hasting step is used in the Gibbs iteration.

Here is an implementation: MCMC sampling for dummies

This new step ensure the transition kernel is reversible, which guarantee that the Markov chain converges to the correct invariant measure.

The core idea is to compute an acceptance ratio, which is the probability to jump to the next step.

We illustrates the algorithm here with an example and a measure of MSE to assess the quality of convergence.

Suppose we want to sample from a density f(x,a)M(a)f(x, a)M(a)f(x,a)M(a)﻿, with M(a)M(a)M(a)﻿ normalising constant

Choose a proposal Q(x, .)Q(x, .)Q(x, .)﻿ with density q(x, y)q(x, y)q(x, y)﻿. For exemple y ∼ N(x, s)y ∼ N(x, s)y ∼ N(x, s)﻿​

At state nnn﻿, propose new state:

Xn∗  ∼  Q(Xn−1,.)X_n^*\;\sim\;Q(X_{n-1},.)Xn∗​∼Q(Xn−1​,.)﻿​

Compute the acceptance ratio:

α(Xn−1,Xn∗)=min(1,f(Xn∗,a)q(Xn∗∣Xn−1)f(Xn−1,a)q(Xn−1∣Xn∗))\alpha(X_{n-1},X_n^*)=min\bigg(1, \frac{f(X_n^*,a)q(X^*_n|X_{n-1})}{f(X_{n-1},a)q(X_{n-1}|X_n^*)}\bigg)α(Xn−1​,Xn∗​)=min(1,f(Xn−1​,a)q(Xn−1​∣Xn∗​)f(Xn∗​,a)q(Xn∗​∣Xn−1​)​)﻿​

With probability α\alphaα﻿, Xn=Xn∗X_n=X_n^*Xn​=Xn∗​﻿, otherwise Xn=Xn−1X_n=X_{n-1}Xn​=Xn−1​﻿ with proba 1−α1-\alpha1−α﻿​

If we take the model:

yi  ∼  N(0,τ2),  p(τ)=11+τ2y_i\;\sim\;N(0,\tau^2),\;p(\tau)=\frac{1}{1+\tau^2}yi​∼N(0,τ2),p(τ)=1+τ21​﻿​

the only parameter is τττ﻿, so:

p(τ∣y)∝p(y∣τ)p(τ)∝(∏i=1n12πσ2e−yi2/2τ2)11+τ2p(\tau|y)\propto p(y|\tau)p(\tau) \propto \bigg(\prod^n_{i=1}\frac{1}{\sqrt{}2\pi\sigma^2}e^{-y_i^2/2\tau^2} \bigg)\frac{1}{1+\tau^2}p(τ∣y)∝p(y∣τ)p(τ)∝(∏i=1n​​2πσ21​e−yi2​/2τ2)1+τ21​﻿​

Let’s use MH. For q(τ,τ∗)q(\tau, \tau*)q(τ,τ∗)﻿ take:

τ∗  ∼  LogNormal(log  τ,s)\tau^*\;\sim\;LogNormal(log\;\tau,s)τ∗∼LogNormal(logτ,s)﻿​

with s extra parameter to make convergence slower or faster (ideal is a ratio of 0.23). The acceptance ratio is

α=p(τ∗∣y)q(τ∗,τ)p(τ∣y)q(τ,τ∗)=ττ∗\alpha=\frac{p(\tau^*|y)q(\tau^*,\tau)}{p(\tau|y)q(\tau,\tau^*)}=\frac{\tau}{\tau^*}α=p(τ∣y)q(τ,τ∗)p(τ∗∣y)q(τ∗,τ)​=τ∗τ​﻿​

We know that:

1n∑j=0n−1ϕ(Xk)→n→∞∫ϕ(x)pN(x)dx=μϕ\frac{1}{n}\sum^{n-1}_{j=0}\phi(X_k)\rightarrow^{n\rightarrow\infty}\int\phi(x)p_N(x)dx=\mu\phin1​∑j=0n−1​ϕ(Xk​)→n→∞∫ϕ(x)pN​(x)dx=μϕ﻿​

We assess the error with MSE (which is frequentist)

E[(1n∑(ϕ(Xk))−μϕ)2]=1n∑j=0n−1∑l=0n−1Cov(ϕ(Xj),ϕ(Xl))+1n2∑j=0n−1∑l=0n−1(νKj−μ)ϕ(νKl−μ)ϕ\mathbb{E}\Big[\Big(\frac{1}{n}\sum(\phi(X_k))-\mu\phi\Big)^2\Big]=\frac{1}{n}\sum^{n-1}_{j=0}\sum^{n-1}_{l=0}Cov(\phi(X_j),\phi(X_l))+\frac{1}{n^2}\sum^{n-1}_{j=0}\sum^{n-1}_{l=0}(\nu K^j-\mu)\phi(\nu K^l-\mu)\phiE[(n1​∑(ϕ(Xk​))−μϕ)2]=n1​∑j=0n−1​∑l=0n−1​Cov(ϕ(Xj​),ϕ(Xl​))+n21​∑j=0n−1​∑l=0n−1​(νKj−μ)ϕ(νKl−μ)ϕ﻿​

If we sample independently it becomes:

1n2∑i=0n−1V[ϕ(Xj)]=1nV[ϕ(X)]\frac{1}{n^2}\sum^{n-1}_{i=0}\mathbb{V}[\phi(X_j)]=\frac{1}{n} \mathbb{V}[\phi(X)]n21​∑i=0n−1​V[ϕ(Xj​)]=n1​V[ϕ(X)]﻿​

If we start at ν = μ, then the bias would be 0 but not the covariance.

 To have convergence of the MSE, we would like an exponential decay of the covariance

Cov(ϕ(Xj),ϕ(Xl))=αˉ(j−l)  with  αˉ∈(0,1)Cov(\phi(X_j),\phi(X_l))=\bar{\alpha}^{(j-l)}\;with\;\bar{\alpha}\in(0, 1)Cov(ϕ(Xj​),ϕ(Xl​))=αˉ(j−l)withαˉ∈(0,1)﻿​

We can diagnosis convergence via:

Asymptotic variance (or effective sample size)

Trace plotting

Comparison across multiple chains

Coupling

2.2.a Asymptotic variance

The Effective sample size (ESS) allows us to estimate σ, thanks to the Central Limit Theorem.

We have:

n[1n∑j=0n−1ϕ(Xj)−μϕ]⇒N(0,σ2)\sqrt{n}\bigg[\frac{1}{n}\sum^{n-1}_{j=0}\phi(X_j)-\mu\phi\bigg]\Rightarrow N(0,\sigma^2)n​[n1​∑j=0n−1​ϕ(Xj​)−μϕ]⇒N(0,σ2)﻿​

where

σ2=V[X0]∑j=0∞Corr(X0,Xj)\sigma^2=\mathbb{V}[X_0]\sum^{\infty}_{j=0}Corr(X_0,X_j)σ2=V[X0​]∑j=0∞​Corr(X0​,Xj​)﻿​

We want to estimate σ2σ^2σ2﻿.

For batches bn = n1/3b_n = n^{1/3}bn​ = n1/3﻿ there are n − bnn − b_nn − bn​﻿ such batches and

σ2^=(n−1)bn(n−bn)(n−bn−1)∑j=0n−bn−1(Xˉj(bn)−Xˉn)\hat{\sigma^2}=\frac{(n-1)b_n}{(n-b_n)(n-b_n-1)}\sum^{n-b_n-1}_{j=0}(\bar{X}_j(b_n)-\bar{X}_n)σ2^=(n−bn​)(n−bn​−1)(n−1)bn​​∑j=0n−bn​−1​(Xˉj​(bn​)−Xˉn​)﻿​

where

Xˉj(bn)=1bn∑i=1bnXj+i,    Xˉn=∑i=1n−1Xi\bar{X}_j(b_n)=\frac{1}{b_n}\sum^{b_n}_{i=1}X_{j+i}, \;\;\bar{X}_n=\sum^{n-1}_{i=1}X_iXˉj​(bn​)=bn​1​∑i=1bn​​Xj+i​,Xˉn​=∑i=1n−1​Xi​﻿​

So the ESS is:

ESS=nVˉ[X0]σ^2ESS=n\frac{\bar{\mathbb{V}}[X_0]}{\hat{\sigma}^2}ESS=nσ^2Vˉ[X0​]​﻿​

In the special case of Monte Carlo, note ESS = n, but because of correlation in the Markov chain this might be much worse.

2.2.b Plotting Autocorrelation

We plot the autocorrelation of the chain, ideally quickly decreasing to zero (the chain itself looks like white noise).

Let’s see how to use Gibb sampling to compute conditional mean and variance.

Suppose we are interested in sampling

Y  ∼  N(μ,Σ)Y\;\sim\;N(\mu,\Sigma)Y∼N(μ,Σ)﻿​

Let’s use Gibbs, since sampling from the conditionals is straightforward:

(yAyB)  ∼  N((μAμB),[ΣAAΣABΣBAΣBB])\binom{y_A}{y_B}\;\sim\;N\bigg(\binom{\mu_A}{\mu_B}, \Big[\begin{array}{cc}\Sigma_{AA}&\Sigma_{AB}\\\Sigma_{BA}&\Sigma_{BB}\end{array} \Big]\bigg)(yB​yA​​)∼N((μB​μA​​),[ΣAA​ΣBA​​ΣAB​ΣBB​​])﻿​

then the conditionals mean and variance are given by

μ∗=μA+ΣABΣBB−1(yB−μB)Σ∗=ΣAA−ΣABΣBB−1ΣBA\mu^*=\mu_A+\Sigma_{AB}\Sigma_{BB}^{-1}(y_B-\mu_B)\\\Sigma^*=\Sigma_{AA}-\Sigma_{AB}\Sigma_{BB}^{-1}\Sigma_{BA}μ∗=μA​+ΣAB​ΣBB−1​(yB​−μB​)Σ∗=ΣAA​−ΣAB​ΣBB−1​ΣBA​﻿​

2.2.c Comparison

We run multiple choices and compare them using the Gelman-Rubin diagnostic.

The Gelman-Rubin diagnostic is:

run mmm﻿ chains of length 2n2n2n﻿, starting over dispersed points, discard first nnn﻿​

compute the between variance:

Bn=∑i=1m(xˉi−xˉ)2n−1\frac{B}{n}=\sum^m_{i=1}\frac{(\bar{x}_i-\bar{x})^2}{n-1}nB​=∑i=1m​n−1(xˉi​−xˉ)2​﻿​

and the within variance

W=1m∑i=1m(1n−1∑j=1nxij−xˉi)2W=\frac{1}{m}\sum_{i=1}^m \bigg(\frac{1}{n-1}\sum^n_{j=1}x_{ij}-\bar{x}_i\bigg)^2W=m1​∑i=1m​(n−11​∑j=1n​xij​−xˉi​)2﻿​

and take the estimated variance to be

σ^+2=n−1nW+Bn\hat{\sigma}^2_+=\frac{n-1}{n}W+\frac{B}{n}σ^+2​=nn−1​W+nB​﻿​

However the shortcomings of this idea is to only consider the effect of auto-correlations for the identity function.

We need to bound autocorrelations for a huge class of functions if we want to converge in total variation

2.2.d Coupling

We run m pairs of chains to simulate two dependent Markov chains, and run the chains until they meet

We have: (X,Y)∼Γ(ν0,ν1)(X,Y)\sim\Gamma(\nu_0,\nu_1)(X,Y)∼Γ(ν0​,ν1​)﻿​

so that marginally

X∼ν0  and  Y∼ν1X\sim\nu_0\;and\;Y\sim\nu_1X∼ν0​andY∼ν1​﻿​

We define the total variation metric between distributions:

∣∣ν0−ν1∣∣TV=supA∣ν0(A)−ν1(A)∣||\nu_0-\nu_1||_{TV}=sup_A|\nu_0(A)-\nu_1(A)|∣∣ν0​−ν1​∣∣TV​=supA​∣ν0​(A)−ν1​(A)∣﻿​

The coupling inequality states that

P[X≠Y]≥∣∣ν0−ν1∣∣TVP[X \neq Y]\geq||\nu_0-\nu_1||_{TV}P[X=Y]≥∣∣ν0​−ν1​∣∣TV​﻿​

Thus the idea is to simulate from two dependent Markov chains at different points

Xn∼ν0Kn  and  Yn∼ν1KnX_n\sim\nu_0K^n\;and\;Y_n\sim\nu_1K^nXn​∼ν0​KnandYn​∼ν1​Kn﻿​

and run the chain until they meet

For exemple take:

X∼fW∣X∼U(0,f(X))X\sim f \\W|X\sim U(0,f(X))X∼fW∣X∼U(0,f(X))﻿​

if W<g(X)W < g(X)W<g(X)﻿ output (X,X)(X, X)(X,X)﻿​

else sample Y∗∼gY^* \sim gY∗∼g﻿ and take W∗∣Y∗∼U(0, g(Y∗))W^*|Y^*\sim U(0, g(Y^*))W∗∣Y∗∼U(0, g(Y∗))﻿ until W∗ > f(Y∗)W^* > f(Y^*)W∗ > f(Y∗)﻿ and output (X, Y∗)(X, Y^*)(X, Y∗)﻿​

3. Nontrivial models

3.1 Linear regression

We see how Bayesian linear regression

zN×1=WN×dβd×1+ϵN×1with  ϵ∼N(0,σ2I)z_{N×1}=W_{N×d}\beta_{d×1}+\epsilon_{N×1}\\with\;\epsilon\sim N(0,\sigma^2I)zN×1​=WN×d​βd×1​+ϵN×1​withϵ∼N(0,σ2I)﻿​

easily leads to ridge estimate

β^ridge=E[β∣W,σ2,z,τ2]=(WTW+1τ2I)−1WTz\hat{\beta}_{ridge}=\mathbb{E}[\beta|W,\sigma^2,z,\tau^2]=\Big(W^TW+\frac{1}{\tau^2}I\Big)^{-1}W^Tzβ^​ridge​=E[β∣W,σ2,z,τ2]=(WTW+τ21​I)−1WTz﻿​

where z is the response, W is the matrix, β is the vector of regression coefficients, and ϵ is the vector of random error.

The usual least square estimate is

β^OLS=(WTW)−1WTz\hat{\beta}_{OLS}=(W^TW)^{-1}W^Tzβ^​OLS​=(WTW)−1WTz﻿​

which is also the MLE for β. The MLE for σ2 is given by

σ^2=1N(Z−Wβ^)T(Z−Wβ^)=∣∣zT(Id−W(WTW)−1WT)z∣∣2\hat{\sigma}^2=\frac{1}{N}(Z-W\hat{\beta})^T(Z-W\hat{\beta})=||z^T(I_d-W(W^TW)^{-1}W^T)z||^2σ^2=N1​(Z−Wβ^​)T(Z−Wβ^​)=∣∣zT(Id​−W(WTW)−1WT)z∣∣2﻿​

These work fine when N >> d, but if d is big, using shrinkage make sense. So we consider the Bayesian hierarchical model

z∼N(Wβ,σ2)β∼N(0,τ2σ2I)z\sim N(W\beta,\sigma^2)\\\beta\sim N(0,\tau^2\sigma^2I)z∼N(Wβ,σ2)β∼N(0,τ2σ2I)﻿​

Thus, we have

−2log(L(z∣β,w,σ2)p(β))∝1σ2(z−WβT)(z−Wβ)+1σ2τ2βTβ∝βT[1σ2WTW+1σ2τ2I]−2βT(WTzσ2)-2log(L(z|\beta,w,\sigma^2)p(\beta))\propto \frac{1}{\sigma^2}(z-W\beta^T)(z-W\beta)+\frac{1}{\sigma^2\tau^2}\beta^T\beta\\\propto\beta^T\Big[\frac{1}{\sigma^2}W^TW+\frac{1}{\sigma^2\tau^2}I\Big]-2\beta^T\Big(\frac{W^Tz}{\sigma^2}\Big)−2log(L(z∣β,w,σ2)p(β))∝σ21​(z−WβT)(z−Wβ)+σ2τ21​βTβ∝βT[σ21​WTW+σ2τ21​I]−2βT(σ2WTz​)﻿​

so

β∣W,σ2,τ2∼N(Vμ∗,V)\beta|W,\sigma^2,\tau^2\sim N(V\mu^*,V)β∣W,σ2,τ2∼N(Vμ∗,V)﻿​

with

V=(1σ2WTW+1σ2τ2I)−1μ∗=1σ2WTzV = \Big(\frac{1}{\sigma^2}W^TW+\frac{1}{\sigma^2\tau^2I}\Big)^{-1} \\ \mu^*=\frac{1}{\sigma^2}W^TzV=(σ21​WTW+σ2τ2I1​)−1μ∗=σ21​WTz﻿​

Thus the posterior mean is:

β^ridge=E[β∣W,σ2,z,τ2]=Vμ∗=(WTW+1τ2I)−1WTz\hat{\beta}_{ridge}=\mathbb{E}[\beta|W,\sigma^2,z,\tau^2]=V\mu^*=\Big(W^TW+\frac{1}{\tau^2}I\Big)^{-1}W^Tzβ^​ridge​=E[β∣W,σ2,z,τ2]=Vμ∗=(WTW+τ21​I)−1WTz﻿​

This is the usual ridge regression estimate, which solves

β^ridge=maxβ−∣∣Wβ−z∣∣2−τ−2∣∣β∣∣2\hat{\beta}_{ridge}=max_{\beta}-||W\beta-z||^2-\tau^{-2}||\beta||^2β^​ridge​=maxβ​−∣∣Wβ−z∣∣2−τ−2∣∣β∣∣2﻿​

3.2 Tikhonov regularisation

More generally we have Tikhonov regularization

β^ridge=(WTW+A−1)−1WTz\hat{\beta}_{ridge}=(W^TW+A^{-1})^{-1}W^Tzβ^​ridge​=(WTW+A−1)−1WTz﻿​

where A symmetric, positive definite.

We regularise because:

Adding a bias may improve our estimation of a vector with more than 2 dimensions (see JS estimator)

When p > N, least square can’t be done, but regularised solution can

We see in that its much harder to get A with Empirical Bayes than Full Bayes (in particular when β doesn’t follow a Normal distribution).

We maximise the marginal likelihood (in term of τ2\tau^2τ2﻿)

Pick a prior structure:

β∣σ2,τ2∼N(0,σ2τ2I)σ2∼InvGamma(a,b)\beta|\sigma^2,\tau^2\sim N(0,\sigma^2\tau^2I)\\\sigma^2\sim InvGamma(a,b)β∣σ2,τ2∼N(0,σ2τ2I)σ2∼InvGamma(a,b)﻿​

which can be reframed
β∼N(0,σ2τ2I)Z∼N(Wβ,σ2I)\beta\sim N(0,\sigma^2\tau^2I)\\Z\sim N(W\beta,\sigma^2I)β∼N(0,σ2τ2I)Z∼N(Wβ,σ2I)﻿​

so:

Z∼N(0,σ2(τ−2WWT+I))Z\sim N(0,\sigma^2(\tau^{-2}WW^T+I))Z∼N(0,σ2(τ−2WWT+I))﻿​

We compute easily the likelihood:
p(z∣σ2τ2,W)p(z|\sigma^2\tau^2,W)p(z∣σ2τ2,W)﻿​

Now we just need to multiply by the prior density for σ2\sigma^2σ2﻿ and integrate over σ2\sigma^2σ2﻿. This is easy since the function is the kernel of an inverse-gamma distribution.

Then we pick τ\tauτ﻿ according to:
maxτ2∫p(z∣W,σ2,β,τ2)p(β,σ2)dβdσ2max_{\tau^2}\int p(z|W,\sigma^2,\beta,\tau^2)p(\beta,\sigma^2)d\beta d\sigma^2maxτ2​∫p(z∣W,σ2,β,τ2)p(β,σ2)dβdσ2﻿​

If β\betaβ﻿ were not coming from a Normal distribution, this problem could quickly become intractable (above all because β\betaβ﻿ is high-dimensional).

Full Bayes To estimate A, we put a prior on τ . What possibilities?

Discrete Uniform on a set of grid points

Uniform on an interval

Half-Cauchy

Inverse Gamma

We use Half-Cauchy: it requires no hyperparameters and it is reasonably uninformative (it let the data speak for itself). The full model is:

β∣σ2,τ2∼N(0,σ2τ2I)σ2∼InvGamma(a,b)p(τ)∼12π11+τ2\beta|\sigma^2,\tau^2\sim N(0, \sigma^2\tau^2I)\\\sigma^2\sim InvGamma(a,b)\\p(\tau)\sim \frac{1}{2\pi}\frac{1}{1+\tau^2}β∣σ2,τ2∼N(0,σ2τ2I)σ2∼InvGamma(a,b)p(τ)∼2π1​1+τ21​﻿​

This can be done using Gibbs sampling

β∣τ2,σ2,W,zσ2∣τ2,β,W,zτ2∣σ2,β2\beta|\tau^2,\sigma^2,W,z\\\sigma^2|\tau^2,\beta,W,z\\\tau^2|\sigma^2,\beta^2β∣τ2,σ2,W,zσ2∣τ2,β,W,zτ2∣σ2,β2﻿​

The first two distribution are easy to sample from. The third distribution is well known and we can directly sample it with Metropolis-Hastings

3.3 Conjugate prior

Another option to solve Bayesian linear regression is using a conjugate prior.

We see how Zellner’s g prior gives us a posterior with a mean that is a shrunken factor of the MLE.

We have:

β,σ2∼NΓ−1(μ,V,a,b)\beta,\sigma^2\sim N\Gamma^{-1}(\mu,V,a,b)β,σ2∼NΓ−1(μ,V,a,b)﻿​

which is equivalent to:

σ2∼InvGamma(a,b)β∣σ2∼N(μ,σ2V)\sigma^2\sim InvGamma(a,b)\\\beta|\sigma^2\sim N(\mu,\sigma^2 V)σ2∼InvGamma(a,b)β∣σ2∼N(μ,σ2V)﻿​

so:

p(β,σ2)∝(σ2)−p/2(σ2)−a−1e−1σ2(b+(β−μ)TV−1(β−μ))p(\beta,\sigma^2)\propto (\sigma^2)^{-p/2}(\sigma^2)^{-a-1}e^{-\frac{1}{\sigma^2}(b+(\beta-\mu)^TV^{-1}(\beta-\mu))}p(β,σ2)∝(σ2)−p/2(σ2)−a−1e−σ21​(b+(β−μ)TV−1(β−μ))﻿​

And if

Z∼N(Wβ,σ2I)Z\sim N(W\beta,\sigma^2I)Z∼N(Wβ,σ2I)﻿​

then the posterior is:

β,σ2∣z,W∼NΓ−1(μ∗,V∗,a∗,b∗)\beta,\sigma^2|z, W\sim N\Gamma^{-1}(\mu^*,V^*,a^*,b^*)β,σ2∣z,W∼NΓ−1(μ∗,V∗,a∗,b∗)﻿​

where

μ∗=(V−1+WTW)−1(V−1μ+WTz)V∗=(V−1+WTW)−1a∗=a+N/2b∗=b+1/2[μTV−1μ+zTz−(μ∗)T(V∗)−1μ∗]\mu^*=(V^{-1}+W^TW)^{-1}(V^{-1}\mu+W^Tz)\\V^*=(V^{-1}+W^TW)^{-1}\\a^*=a+N/2\\b^*=b+1/2\Big[\mu^TV^{-1}\mu+z^Tz-(\mu^*)^T(V^*)^{-1}\mu^*\Big]μ∗=(V−1+WTW)−1(V−1μ+WTz)V∗=(V−1+WTW)−1a∗=a+N/2b∗=b+1/2[μTV−1μ+zTz−(μ∗)T(V∗)−1μ∗]﻿​

If we use the Zellner’s g prior:

p(β∣σ2,g)∼N(0,σ2g(WTW)−1)p(σ2)∼1σ2p(\beta|\sigma^2,g)\sim N(0,\sigma^2g(W^TW)^{-1})\\p(\sigma^2)\sim \frac{1}{\sigma^2}p(β∣σ2,g)∼N(0,σ2g(WTW)−1)p(σ2)∼σ21​﻿​

and we find that the posterior is still Normal InvGamma, but his mean has become

μ∗=gg+1(WTW)−1WTz=gg+1β^\mu^*=\frac{g}{g+1}(W^TW)^{-1}W^Tz=\frac{g}{g+1}\hat{\beta}μ∗=g+1g​(WTW)−1WTz=g+1g​β^​﻿​

3.4 Bayesian variable selection

We saw that for Ridge or l1l_1l1​﻿ penalisation the problem was

maxβ∣∣Wβ−z∣∣2+λ∣∣β∣∣1max_{\beta}||W\beta-z||^2+\lambda||\beta||_1maxβ​∣∣Wβ−z∣∣2+λ∣∣β∣∣1​﻿​

In high-dimension, Lasso or l0l_0l0​﻿ penalisation allows us to select features automatically. The ideal problem in this case is:

maxβ∣∣Wβ−z∣∣2−λ∣∣β∣∣0max_{\beta}||W\beta-z||^2-\lambda||\beta||_0maxβ​∣∣Wβ−z∣∣2−λ∣∣β∣∣0​﻿​

We still need to select λ with AIC (λ = 1) or BIC (λ = 2log(n)).

If we think of variable selection as hypothesis testing, we can compute the Bayes factor for each variable.

We can think of variable selection as:

γj={1  if  βj≠0:H1j0  if  βj=0:H0j\gamma_j=\bigg\{\begin{array}{cc} 1\;if\;\beta_j\neq0:H_{1j}\\0\;if\;\beta_j=0:H_{0j}\end{array}γj​={1ifβj​=0:H1j​0ifβj​=0:H0j​​﻿​

Bayesian testing in case of one hypothesis is

p[H0∣z]=11+1−qqp(z∣H1)p(z∣H0)p[H_0|z]=\frac{1}{1+\frac{1-q}{q}\frac{p(z|H_1)}{p(z|H_0)}}p[H0​∣z]=1+q1−q​p(z∣H0​)p(z∣H1​)​1​﻿​

We can then compute p(z|H1) and p(z|H0) to find the Bayes factor.

Final words

Thank you for reading, it has been quite a journey and I hope that I successfully put those complex concepts in a short and easy form. There are some nice state of the art here and here that I leave you to exploring. See you on a next post!