Proba ML
4. Statistics
4.6 Bayesian Statistics

4.6 Bayesian statistics

Inference is modeling uncertainty about parameters using a probability distribution (instead of a point estimate). In Bayesian statistics we use the posterior distribution.

p(θD)=p(Dθ)p(θ)p(D)p(\theta|\mathcal{D})=\frac{p(\mathcal{D}|\theta)p(\theta)}{p(\mathcal{D})}

Once we have computed the posterior over θ\theta, we compute the posterior predictive distribution by marginalizing out θ\theta. In the supervised setting:

p(yx,D)=p(yx,θ)p(θD)dθp(y|x,\mathcal{D})=\int p(y|x,\theta)p(\theta|\mathcal{D})d\theta

This is Bayes model averaging (BMA), since we use an infinite amount set of models, weighted by how likely their parameters are.

4.6.1 Conjugate prior

A prior p(θ)Fp(\theta)\in\mathcal{F} is conjugate for a likelihood function p(Dθ)p(\mathcal{D}|\theta) if the posterior p(θD)Fp(\theta|\mathcal{D})\in\mathcal{F}. The family F\mathcal{F} is closed under bayesian updating.

For simplicity, we consider a model without covariates.

4.6.2 Beta-binomial model

4.6.2.1 Bernoulli likelihood

We toss a coin NN times to estimate the head probability θ\theta. We want to compute p(θD)p(\theta|\mathcal{D}).

Under the iid hypothesis:

p(Dθ)=θN1(1θ)N0p(\mathcal{D}|\theta)=\theta^{N_1}(1-\theta)^{N_0}

4.6.2.2 Binomial likelihood

Instead of observing a sequence of coin toss, we can count the number of heads using a binomial likelihood

p(Dθ)=Bin(yN,θ)=(Ny)θy(1θ)Nyp(\mathcal{D}|\theta)=\mathrm{Bin}(y|N,\theta)=\binom{N}{y}\theta^{y}(1-\theta)^{N-y}

Since these likelihood are proportional, we will use the same inference about θ\theta for both models.

4.6.2.3 Prior

The Beta distribution is the conjugate prior to Bernoulli or Binomial likelihood.

p(θ)=Beta(θa,b)=θa1(1θ)b1p(\theta)=\mathrm{Beta}(\theta|a,b)=\theta^{a-1}(1-\theta)^{b-1}

4.6.2.4 Posterior

If we multiply the Bernoulli likelihood with the Beta prior, we obtained the following posterior:

p(θD)=θa1+N1(1θ)b1+N0=Beta(θa+N1,b+N0)=Beta(θa˘,b˘)\begin{align} p(\theta|\mathcal{D})&=\theta^{a-1+N_1}(1-\theta)^{b-1+N_0} \\ &=\mathrm{Beta}(\theta|a+N_1,b+N_0) \\ &=\mathrm{Beta}(\theta|\breve{a},\breve{b}) \end{align}

Here the hyper-parameters of the prior a,ba,b play an similar role to the sufficient statistics N1,N0N_1,N_0. We call a,ba,b pseudo counts and N1,N0N_1,N_0 observed counts.

4.6.2.5 Example

If we set a=b=2a=b=2, it would means that we have already observed 2 heads and 2 tails before we see the actual data. This is a weak preference for the value of θ=0.5\theta=0.5 (small update in figure (a) below).

If we set a=b=1a=b=1, the prior become p(θ)=θ0(1θ)0=Unif(θ0,1)p(\theta)=\theta^{0}(1-\theta)^0=\mathrm{Unif}(\theta|0,1). The prior is uninformative, so there is no update.

Screen Shot 2022-11-20 at 19.19.51.png

4.2.2.6 Posterior mode (MAP estimate)

Similarly to the MAP estimate already we saw in the regularization section, we have

θ^map=a+N11a+b+N2\hat{\theta}_{map}=\frac{a+N_1-1}{a+b+N-2}

If we use a=b=2a=b=2, we find back the add-one smoothing.

If we use a=b=1a=b=1, we have θ^map=θ^mle\hat{\theta}_{map}=\hat{\theta}_{mle}

4.6.2.7 Posterior mean

The posterior mean is a more robust estimation than the posterior mode since it integrates over the whole space (instead of a single point).

If p(θD)=Beta(θa˘,b˘)p(\theta|\mathcal{D})=\mathrm{Beta}(\theta|\breve{a},\breve{b}), the posterior mean is:

θˉE[θD]=a˘N˘\bar{\theta} \triangleq \mathbb{E}[\theta|\mathcal{D}]=\frac{\breve{a}}{\breve{N}}

with N˘=a˘+b˘\breve{N}=\breve{a}+\breve{b} the strength of the posterior.

The posterior mean is a convex combination of the prior mean, m=aNabm=\frac{a}{N_{ab}} and the MLE: θ^mle=N1N\hat{\theta}_{mle}=\frac{N_1}{N}

E[θD]=a+N1a+b+N1+N0=NabN+NabaNab+NN+NabN1N=λm+(1λ)θ^mle\begin{align} \mathbb{E}[\theta|\mathcal{D}]&=\frac{a+N_1}{a+b+N_1+N_0}\\ &=\frac{N_{ab}}{N+N_{ab}}\frac{a}{N_{ab}}+\frac{N}{N+N_{ab}}\frac{N_1}{N} \\ &= \lambda m+(1-\lambda)\hat{\theta}_{mle} \end{align}

4.6.2.8 Posterior variance

The variance of a Beta posterior is given by:

V[θD]=a˘b˘(a˘+b˘)2(a˘+b˘+1)N1N0N3=θ(1θ)N\mathbb{V}[\theta|\mathcal{D}]=\frac{\breve{a}\breve{b}}{(\breve{a}+\breve{b})^2(\breve{a}+\breve{b}+1)}\approx \frac{N_1N_0}{N^3}=\frac{\theta(1-\theta)}{N}

Thus, the standard error of our estimate (aka the posterior variance) is:

se(θ)=V[θD]=θ(1θ)Nse(\theta)=\sqrt{\mathbb{V}[\theta|\mathcal{D}]}=\sqrt{\frac{\theta(1-\theta)}{N}}

4.6.2.9 Posterior predictive

For the Bernoulli distribution, the posterior predictive distribution has the form:

p(y=1D)=01p(y=1θ)p(θD)dθ=01θBeta(θa,b)dθ=E[θD]=a˘a˘+b˘\begin{align} p(y=1|\mathcal{D})&=\int^1_0 p(y=1|\theta)p(\theta|\mathcal{D})d\theta \\ &= \int^1_0\theta\, \mathrm{Beta} (\theta|a,b)d\theta \\ &= \mathbb{E}[\theta|\mathcal{D}] \\ &= \frac{\breve{a}}{\breve{a}+\breve{b}} \end{align}

For the Binomial distribution, the posterior predictive distribution has the form of the Beta-Binomial distribution:

p(yD,M)=01Bin(yM,θ)Beta(θa˘,b˘)dθ=(My)1B(a˘,b˘)01θy(1θ)Myθa˘1(1θ)b˘1dθ=(My)B(y+a˘,My+b˘)B(a˘,b˘)Bb(yM,a˘,b˘)\begin{align} p(y|\mathcal{D},M)&=\int^1_0\mathrm{Bin}(y|M,\theta)\,\mathrm{Beta}(\theta|\breve{a},\breve{b})d\theta \\ &= \binom{M}{y}\frac{1}{\Beta(\breve{a},\breve{b})}\int^1_0\theta^y(1-\theta)^{M-y}\theta^{\breve{a}-1}(1-\theta)^{\breve{b}-1}d\theta \\ &=\binom{M}{y}\frac{\Beta(y+\breve{a},M-y+\breve{b})}{\Beta(\breve{a},\breve{b})} \\ &\triangleq \mathrm{Bb}(y|M,\breve{a},\breve{b}) \end{align}

We plot in figure (a) below the posterior predictive distribution for M=10,N1=4,N0=1M=10,N_1=4,N_0=1, with a uniform prior Beta(θ1,1)\mathrm{Beta}(\theta|1,1).

(Remember that a˘=a+N1\breve{a}=a+N_1 and b˘=b+N0\breve{b}=b+N_0)

We plot in figure (b) the plugin approximation, where θ^map\hat{\theta}_{map} is directly injected into the likelihood: p(yD,M)=Bin(yM,θ^map)p(y|\mathcal{D},M)=\mathrm{Bin}(y|M,\hat{\theta}_{map})

Screen Shot 2022-11-20 at 23.14.45.png

The long tail of the Bayesian approach is less prone to overfitting and Black Swan paradoxes. In both case, the prior is uniform, so this Bayesian property is due to the integration over unknown parameters.

4.6.2.10 Marginal likelihood

The marginal likelihood for a model M\mathcal{M} is defined as:

p(DM)=p(Dθ,M)p(θM)dθp(\mathcal{D|M})=\int p(\mathcal{D}|\theta,\mathcal{M})p(\theta|\mathcal{M})d\theta

Since it is constant in θ\theta, it is not useful for parameter inference, but can help for model or hyperparameters selection (aka Empirical Bayes)

The marginal likelihood for the Beta Binomial is:

p(θD)=p(Dθ)p(θ)p(D)=(NN1)1p(D)1B(a,b)θa+N11(1θ)b+N01\begin{align} p(\theta|\mathcal{D})&=\frac{p(\mathcal{D}|\theta)p(\theta)}{p(\mathcal{D})} \\ &= \binom{N}{N_1}\frac{1}{p(\mathcal{D})}\frac{1}{\Beta(a,b)}\theta^{a+N_1-1}(1-\theta)^{b+N_0-1} \end{align}

so

1B(a+N1,b+N0)=(NN1)1p(D)1B(a,b)p(D)=(NN1)B(a+N1,b+N0)B(a,b) \begin{align} \frac{1}{\Beta(a+N_1, b+N_0)}&=\binom{N}{N_1}\frac{1}{p(\mathcal{D})}\frac{1}{\Beta(a,b)} \\ p(\mathcal{D})&= \binom{N}{N_1}\frac{\Beta(a+N_1,b+N_0)}{\Beta(a,b)} \end{align}

The marginal likelihood of the Beta Bernoulli is the same as above without the (NN1)\binom{N}{N_1} term.

4.6.2.11 Mixture of conjugate priors

Our current prior is rather restrictive. If we want to represent a coin that may be fair but also be equally biased towards head. We can use a Mixture of beta distributions:

p(θ)=0.5Beta(20,20)+0.5Beta(30,10)p(\theta)=0.5\,\mathrm{Beta}(20,20)+0.5\,\mathrm{Beta}(30,10)

We introduce h=kh=k meaning that θ\theta comes from mixture hh, so the mixture prior is:

p(θ)=kp(h=k)p(θh=k)p(\theta)=\sum_k p(h=k)p(\theta|h=k)

We can show that the posterior can also be written as a mixture of conjugate distributions:

p(θD)=kp(h=kD)p(θD,h=k)p(\theta|\mathcal{D})=\sum_k p(h=k|\mathcal{D})p(\theta|\mathcal{D},h=k)

where the posterior mixing weights are:

p(h=kD)=p(Dh=k)p(h=k)kp(Dh=k)p(h=k)p(h=k|\mathcal{D})=\frac{p(\mathcal{D}|h=k)p(h=k)}{\sum_{k'}p(\mathcal{D}|h=k')p(h=k')}

We can compute this quantity using the equation (40) of the marginal likelihood. If we observe N1=20N_1=20 heads and N0=10N_0=10 tails, the posterior is:

p(θD)=0.346Beta(θ40,30)+0.654Beta(θ50,20)p(\theta|\mathcal{D})=0.346\,\mathrm{Beta}(\theta|40,30)+0.654\,\mathrm{Beta}(\theta|50,20)

Screen Shot 2022-11-28 at 08.42.40.png

We can compute the probability that the coin is biased towards head:

Pr(θ>0.5D)=kPr(θ>0.5D,h=k)p(h=kD)=0.9604\Pr(\theta>0.5|\mathcal{D})=\sum_k\Pr(\theta>0.5|\mathcal{D},h=k)p(h=k|\mathcal{D})=0.9604

4.6.3 The Dirichlet-multinomial model

We generalize the results from binary variable to categories (e.g. dice).

4.6.3.1 Likelihood

Let YCat(θ)Y\sim \mathrm{Cat}(\theta) be a discrete random variable drawn from categorical distribution. The likelihood has the form:

p(Dθ)=n=1nCat(ynθ)=n=1Nc=1CθcI(yn=c)=c=1CθcNcp(\mathcal{D}|\theta)=\prod^n_{n=1}\mathrm{Cat}(y_n|\theta)=\prod^N_{n=1}\prod^C_{c=1}\theta_c^{\mathbb{I}(y_n=c)}=\prod^C_{c=1} \theta_c^{N_c}

4.6.3.2 Prior

The conjugate prior is a Dirichlet distribution, which is a multivariate of a beta distribution. It has support over the probability simplex:

SK={θ:0θk1,k=1Kθk=1}S_K=\{\theta:0\leq\theta_k\leq1, \sum^K_{k=1}\theta_k=1 \}

The pdf of Dirichlet:

Dir(θa)=1B(a)k=1Kθkak1I(θSK)\mathrm{Dir}(\theta|\bold{a})=\frac{1}{\Beta(\bold{a})}\prod_{k=1}^K\theta_k^{a_k-1}\mathbb{I}(\theta \in S_K)

with the multivariate Beta function:

B(a)=k=1KΓ(ak)Γ(k=1Kak)\Beta(\bold{a})=\frac{\sum_{k=1}^K\Gamma(a_k)}{\Gamma({\sum_{k=1}^Ka_k})}

Screen Shot 2022-11-28 at 09.10.51.png

4.6.3.3 Posterior

The posterior is:

p(θD)p(Dθ)Dir(θa)=kθkNkkθkak1=Dir(θa1+N1,...,aK+Nk)=Dir(θa˘)\begin{align} p(\theta|\mathcal{D})&\propto p(\mathcal{D}|\theta)\mathrm{Dir}(\theta|\bold{a}) \\ &= \prod_k \theta_k^{N_k} \prod_k\theta_k^{a_k-1} \\ &= \mathrm{Dir}(\theta|a_1+N_1,...,a_K+N_k)\\ &= \mathrm{Dir}(\theta|\breve{\bold{a}}) \end{align}

The posterior mean is:

θˉk=a˘kk=1Ka˘k\bar{\theta}_k=\frac{\breve{a}_k}{\sum^K_{k'=1}\breve{a}_{k'}}

The posterior mode (which corresponds to the MAP) is:

θ^k=a˘k1k=1K(a˘k1)\hat{\theta}_k=\frac{\breve{a}_k-1}{\sum_{k'=1}^K(\breve{a}_{k'}-1)}

If ak=1a_k=1, corresponding to a uniform prior, the MAP becomes the MLE:

θk=NkN\theta_k=\frac{N_k}{N}

4.6.3.4 Posterior predictive

The posterior predictive distribution is given by:

p(y=kD)=p(y=kθ)p(θD)dθ=θkp(θkD)dθk=E[θkD]=a˘kka˘k\begin{align} p(y=k|\mathcal{D}) &= \int p(y=k|\theta)p(\theta |\mathcal{D})d\theta \\ &= \int \theta_k p(\theta_k|\mathcal{D})d\theta_k = \mathbb{E}[\theta_k|\mathcal{D}]=\frac{\breve{a}_k}{\sum_{k'}\breve{a}_{k'}} \end{align}

For a batch of data to predict (instead of a single point), it becomes

p(y~y)=p(y,y~)p(y)p(\bold{\tilde{y}|y})=\frac{p(\bold{y,\tilde{y}})}{p(\bold{y})}

Where the denominator and the numerator are marginal likelihoods on the training and training + future test data.

4.6.3.5 Marginal likelihood

The marginal likelihood of the Dirichlet-categorical are given by:

p(D)=B(a+N)B(a)p(\mathcal{D})=\frac{\Beta(\bold{a}+N)}{\Beta(\bold{a})}

4.6.4 Gaussian-Gaussian model

We derive the posterior for the parameters for a Gaussian distribution, assuming that the variance is known for simplicity.

4.6.4.1 Univariate case

We can show that the conjugate prior is also Gaussian. We then apply the Bayes rule for Gaussian, with the observed precision κ\kappa and the prior precision λ\lambda. The posterior is:

p(μD,κ1)=N(μm˘,λ˘1)λ˘=λ+Nκm˘=Nκyˉ+λmλ˘=Nκλ+Nκyˉ+λλ+Nκm\begin{align} p(\mu|\mathcal{D},\kappa^{-1}) &= \mathcal{N}(\mu|\breve{m},\breve{\lambda}^{-1}) \\ \breve{\lambda} &= \lambda +N \kappa \\ \breve{m} &= \frac{N \kappa \bar{y}+\lambda m}{\breve{\lambda}}=\frac{N \kappa}{\lambda+N\kappa}\bar{y}+\frac{\lambda}{\lambda+N\kappa}m \end{align}

The posterior mean is a convex combination of the empirical mean and the prior mean.

To gain more intuition, after seeing N=1N=1 point, our posterior mean is:

m˘=κλ˘y+λλ˘m=m+κλ˘(ym)=yλλ˘(ym)\begin{align} \breve{m} &=\frac{\kappa}{\breve{\lambda}}y+\frac{\lambda}{\breve{\lambda}}m \\ &= m + \frac{\kappa}{\breve{\lambda}}(y-m) \\ &= y-\frac{\lambda}{\breve{\lambda}}(y-m) \end{align}

The second equation is the prior mean adjusted towards the data y.

The third equation is the data adjusted towards the prior mean, called a shrinkage estimate.

For a Gaussian estimate the posterior mean and posterior mode are the same, thus we can use the above equations to perform MAP estimation.

If we set an uninformative prior λ=0\lambda=0 and approximate the observed variance σ2\sigma^2 by

s2=1Nn=1N(ynyˉ)s^2=\frac{1}{N}\sum_{n=1}^N(y_n-\bar{y})

then the posterior variance of our mean estimate is:

se(μ)=V[μD]=1λ˘=sNse(\mu)=\sqrt{\mathbb{V}[\mu|\mathcal{D}]}=\frac{1}{\sqrt{\breve{\lambda}}}=\frac{s}{\sqrt{N}}

We see that uncertainty in μ\mu reduces at the rate 1/N1/\sqrt{N}.

Because 95% of the Gaussian distribution is contained within 2 standard deviations of the mean, the 95% credible interval for μ\mu is:

I.95(μD)=yˉ±2sNI_{.95}(\mu|\mathcal{D})=\bar{y} ± 2\frac{s}{\sqrt{N}}

4.6.4.2 Multivariate case

For DD-dimensional data, the likelihood has the form:

p(Dμ)=n=1NN(ynμ,Σ)=N(2π)D/2Σ1/2exp(12n=1N(ynμ)Σ1(ynμ))=N(yˉμ,1NΣ)\begin{align} p(\mathcal{D}|\mu)&=\prod_{n=1}^N \mathcal{N}(\bold{y}_n|\mu,\Sigma) \\ &= \frac{N}{(2\pi)^{D/2}|\Sigma|^{1/2}}\exp\big( -\frac{1}{2}\sum_{n=1}^N(\bold{y}_n-\mu)^\top\Sigma^{-1}(\bold{y}_n-\mu) \big) \\ &= \mathcal{N}(\bold{\bar{y}}|\mu,\frac{1}{N}\Sigma) \end{align}

Thus we replace the set of observations with their mean and scale down the covariance.

We obtain the same expression for the posterior of the mean than in the univariate case.

Screen Shot 2022-12-04 at 12.34.57.png

4.6.5 Beyond conjugate priors

For most models, there is no prior in the exponential family that conjugates to the likelihood. We briefly discuss various other kinds of prior.

4.6.5.1 Non informative prior

When we have little to no domain specific knowledge, we prefer “let the data speak for itself” (we get closer to the frequentist statistics).

We can for example use a flat prior p(μ)1p(\mu) \propto1, which can be viewed as an infinitely wide Gaussian.

4.6.5.2 Hierarchical prior

The parameters of the prior p(θ)p(\theta) are called hyper-parameters ϕ\phi. If there are unknown, we can put a prior on them: this defines a hierarchical Bayesian model.

If we set a non informative prior on ϕ\phi, the joint distribution has the form:

p(ϕ,θ,D)=p(Dθ)p(θϕ)p(ϕ)p(\phi,\theta,\mathcal{D})=p(\mathcal{D}|\theta)p(\theta|\phi)p(\phi)

We aim at learning ϕ\phi by treating the parameters as data points. This is useful when we want to estimate multiple related parameters from different subpopulations.

4.6.5.3 Empirical prior

The joint distribution of the hierarchical prior can be challenging to compute.

Instead, we make a point wise estimation of the hyper-parameters by maximizing the marginal likelihood:

ϕ^mml(D)=arg maxϕp(Dϕ)=arg maxϕp(Dθ)p(θϕ)dθ\hat{\phi}_{mml}(\mathcal{D})=\argmax _{\phi}p(\mathcal{D}|\phi)=\argmax_{\phi}\int p(\mathcal{D}|\theta)p(\theta|\phi)d\theta

We then compute the conditional posterior p(θϕ^,D)p(\theta|\hat{\phi},\mathcal{D}) in the usual way.

This violates the principle that the prior should be chosen independently of the data, but we can view it as a computationally cheap approximation of the hierarchical prior.

The more integral one performs, the “more Bayesian” one becomes:

Screen Shot 2022-12-04 at 13.14.41.png

Note that ML-II is less likely to overfit than ML since there are often fewer hyper-parameters ϕ\phi than parameters θ\theta.

4.6.6 Credible intervals

The posterior distribution is hard to represent in high dimension. We can summarize it via a point estimate (posterior mean or mode) and compute the associated credible interval (which is different from the frequentist confidence interval).

The Credible Interval for 100(1α)%100(1-\alpha)\% is:

Cα(D)=(,u):P(θuD)=1αC_{\alpha}(\mathcal{D})=(\ell, u):P(\ell \leq \theta \leq u|\mathcal{D})=1-\alpha

If the posterior has a known functional form, we can compute its inverse cdf to find the interval: =F1(α/2)\ell=F^{-1}(\alpha/2) and u=F1(1α/2)u=F^{-1}(1-\alpha/2).

In general, finding the inverse cdf is hard, so instead we rank samples from the posterior distribution and select our target percentiles.

Sometimes, points that are outside the interval have higher density probability than inner points.

We use the highest posterior density (HPD) to counter this, by finding the threshold pp^*:

1α=θ:p(θD)>pp(θD)dθ1-\alpha=\int_{\theta:p(\theta|\mathcal{D})> p^*}p(\theta|\mathcal{D})d\theta

and then defines the HPD as:

Cα(D)=θ:p(θD)pC_{\alpha}(\mathcal{D})=\theta:p(\theta|\mathcal{D})\geq p^*

Screen Shot 2022-12-04 at 20.02.27.png

4.6.7 Bayesian machine learning

So far, we have focused on unconditional models of the form p(yθ)p(y|\theta).

In supervised machine learning, we focus on conditional models of the form p(yx,θ)p(y|x,\theta).

Our posterior becomes p(θD)p(\theta|\mathcal{D}) where D={(xn,yn):n=1:N}\mathcal{D}=\{(x_n,y_n):n=1:N\}. This approach is called Bayesian machine learning.

4.6.7.1 Plugin approximation

Once we have computed the posterior over the θ\theta, we can compute the posterior predictive distribution over yy given xx:

p(yx,D)=p(yx,θ)p(θD)dθp(y|x,\mathcal{D})=\int p(y|x,\theta)p(\theta|\mathcal{D})d\theta

Computing this integral is often intractable. Instead, we approximate that there is a single best model θ^\hat{\theta}, such as the MLE.

p(θD)=δ(θθ^)p(\theta|\mathcal{D})=\delta(\theta-\hat{\theta})

Using the plugin approximation, the predictive distribution is now:

p(yx,D)p(yx,θ)δ(θθ^)dθ=p(yx,θ^)p(y|x,\mathcal{D})\approx\int p(y|x,\theta)\delta(\theta-\hat{\theta})d\theta=p(y|x,\hat{\theta})

This is the standard machine learning approach of fitting a model θ^\hat{\theta} then making prediction.

However, as this approximation can overfit, we average over a few plausible parameters values (instead of the fully Bayesian approach above). Here are 2 examples:

4.6.7.2 Example: scalar input, binary output

To perform binary classification, we can use a logistic regression of the form:

p(yx,θ)=Ber(yσ(wx+b))p(y|x,\theta)=\mathrm{Ber}(y|\sigma(w^\top x+b))

In other words:

p(y=1x,θ)=σ(wx+b)=11+ewx+bp(y=1|x,\theta)=\sigma(w^\top x+b)=\frac{1}{1+e^{-w^\top x + b}}

We find the MLE θ^\hat{\theta} for this 1d logistic regression and use the plugin approximation to get the posterior predictive p(y=1x,θ^)p(y=1|x,\hat{\theta})

The decision boundary is defined as:

σ(wx+b)=1/2x=bw\sigma(w^\top x^* + b)=1/2 \Rightarrow x^*=-\frac{b}{w}

We capture the uncertainty by approximating the posterior p(θD)p(\theta|\mathcal{D}).

Given this, we can approximate the mean and the 95% credible interval of the posterior predictive distribution using a Monte Carlo approximation:

p(y=1x,D)1Ss=1Sp(y=1x,θs)Cα(x,D)={θs:p(θsx,D)>p}\begin{align}p(y=1|x,\mathcal{D})&\approx \frac{1}{S}\sum_{s=1}^Sp(y=1|x,\theta_s) \\ C_{\alpha}(x,\mathcal{D}) &= \{\theta_s:p(\theta_s|x,\mathcal{D})> p^*\} \end{align}

where θsp(θD)\theta_s \sim p(\theta|\mathcal{D}) is a posterior sample.

We can also compute a distribution over the decision boundary:

p(xD)=1Ss=1Sδ(x(bsws))p(x^*|\mathcal{D})=\frac{1}{S}\sum_{s=1}^S \delta(x^*-(-\frac{b_s}{w_s}))

where (bs,ws)=θs(b_s,w_s)=\theta_s

Screen Shot 2022-12-04 at 20.13.18.png

4.6.7.3 Exemple: binary input, scalar output

We now predict the delivery time yy for a package from company A (x=0x=0) or company B

(x=1x=1).

Our model is:

p(yx,θ)=N(yμx,σx2)p(y|x,\theta)=\mathcal{N}(y|\mu_x,\sigma^2_x)

The parameters are θ=(μ0,μ1,σ0,σ1)\theta=(\mu_0, \mu_1, \sigma_0, \sigma_1) so we can fit this model using MLE.

However, the issue with the plug-in approximation p(yx,θ^)p(y|x,\hat{\theta}) is that in the case of one single point of observation, we don’t capture uncertainty, our MLE for standard deviations are σ^0=σ^1=0\hat{\sigma}_0=\hat{\sigma}_1=0.

Instead, the Bayesian approach using the Bayes rule for Gaussian allows to compute the posterior variance and the credible interval.

4.6.8 Computational issues

Usually, performing the Bayesian update is intractable except for simple cases like conjugate models or latent variables with few possible values.

We perform approximate posterior inference for a Beta-Bernoulli model:

p(θD)[n=1NBin(ynθ)]Beta(θa,b)p(\theta|\mathcal{D})\propto \Bigg[\prod_{n=1}^N \mathrm{Bin}(y_n|\theta)\Bigg] \mathrm{Beta}(\theta | a, b)

with D\mathcal{D} consisting in 10 heads and 1 tails, with a uniform prior.

4.6.8.1 Grid approximation

The simplest approach is to partition θ\theta in a finite set θ1,...,θK\theta_1,...,\theta_K and approximate the posterior by brute-force enumeration:

p(θ=θkD)=p(Dθk)p(θk)k=1Kp(Dθk)p(θk)p(\theta=\theta_k|\mathcal{D})=\frac{p(\mathcal{D} | \theta_k)p(\theta_k)}{\sum_{k'=1}^K p(\mathcal{D}|\theta_{k'})p(\theta_{k'})}

This captures the skewness of our posterior, but the number of grid points scales exponentially in high dimension.

4.6.8.2 Quadratic (Laplace) approximation

We approximate the posterior using a multivariate Gaussian as follow:

p(θD)1ZeE(θ)p(\theta|\mathcal{D}) \approx \frac{1}{Z}e^{-\mathcal{E}(\theta)}

where Z=p(D)Z=p(\mathcal{D}) is the normalization constant and E(θ)=logp(θ,D)\mathcal{E}(\theta)=-\log p(\theta,\mathcal{D}) is called an energy function.

Performing a Taylor series around the mode θ^\hat{\theta} (i.e. the lowest energy state), we get:

E(θ)E(θ^)+(θθ^)g+12(θθ^)H(θθ^)\mathcal{E}(\theta)\approx \mathcal{E}(\hat{\theta})+(\theta-\hat{\theta})^\top \bold{g}+\frac{1}{2}(\theta-\hat{\theta})^\top H(\theta-\hat{\theta})

where g\bold{g} is the gradient at the mode and HH the Hessian. Since θ^\hat{\theta} is the mode, the gradient is zero.

Hence:

p^(θ,D)=eE(θ^)exp[12(θθ^)H(θθ^)]p^(θD)=1Zp^(θ,D)=N(θθ^,H1)Z=eE(θ^)(2π)D/2H1/2\begin{align} \hat{p}(\theta,\mathcal{D})&=e^{-\mathcal{E}(\hat{\theta})}\exp\Big[-\frac{1}{2}(\theta-\hat{\theta})^\top H (\theta-\hat{\theta})\Big] \\ \hat{p}(\theta|\mathcal{D}) &= \frac{1}{Z} \hat{p}(\theta,\mathcal{D})=\mathcal{N}(\theta|\hat{\theta},H^{-1}) \\ Z &= e^{-\mathcal{E}(\hat{\theta})}(2\pi)^{D/2}|H|^{-1/2} \end{align}

The last line follows from the normalization constant of the MVN.

This is easy to apply, since after computing the MAP estimate, we just need to compute the Hessian at the mode (in high dimensional space, we use a diagonal approximation).

However we see that the Laplace approximation fits poorly, because the posterior is skewed whereas Gaussian is symmetric.

Additionally, the parameter of interest lies in the constraint interval θ[0,1]\theta\in[0,1] whereas the Gaussian assumes an unconstrained space θR\theta\in\R. This can be solved later by a change of variable α=logit(θ)\alpha=\mathrm{logit}(\theta).

Screen Shot 2022-12-06 at 09.34.22.png

4.6.8.3 Variational inference (VI)

VI is another optimization base posterior approximation. It aims at minimizing some discrepancy DD between our posterior distribution pp and a distribution qq from some tractable family (like MVN):

q=arg minqQD(q,p)q^*=\argmin_{q \in \mathcal{Q}} D(q,p)

If DD is the KL divergence, we can derivate the evidence lower bound (ELBO) of the marginal likelihood. Maximizing the ELBO improves the posterior approximation.

4.6.8.4 Markov Chain Monte Carlo (MCMC)

Although VI is fast, it can give a biased approximation since it is restricted to a specific function form qQq\in\mathcal{Q}.

Monte Carlo is a more flexible, non-parametric approach:

q(θ)=1Ss=1Sδ(θθs)q(\theta)=\frac{1}{S}\sum^S_{s=1}\delta(\theta-\theta_s)

The key issue is to create posterior samples θsp(θD)\theta_s \sim p(\theta|\mathcal{D}) without having to evaluate the normalization constant p(D)=p(θ,D)dθp(\mathcal{D})=\int p(\theta,\mathcal{D})d\theta

MCMC —and Hamiltonian Monte Carlo (HMC)— speeds up this method by augmenting this algorithm with gradient based information derived from logp(θ,D)\nabla \log p(\theta,\mathcal{D}).