/
...
/
/
A primer on Bayes
Search
Duplicate
Try Notion

A primer on Bayes

Overview

This article is the first of the Bayesian series.
You probably already know the Bayes theorem or stumble upon Naive Bayes when comparing machine learning models. Maybe, like me, you feel that you barely saw the shores of the continent that is the Bayesian land and want to set foot in it.
I wrote this article from a Stanford course with the intent for you to understand the relationships between all Bayesian concepts without having to go through an entire classroom (like I did). As a matter of readability, I hide most of the maths in dropdowns. This way, you can follow the logic easily and come back to deep dive into formulas later.
We start with comparing frequentist and bayesian methods, before focusing on different distributions and conjugate families.

1. Frequentist analysis

Suppose we want to estimate the probability of heads of a (possibly biased) given coin. We flip the coin N times, generating a sequence of observations (x1, ...,xN)(x_1 , ..., x_N )ο»Ώ of an iid random variable
(X1, ...,XN)(X_1 , ..., X_N)ο»Ώ, each of which has the Bernoulli distribution with unknown head probability ΞΈ\thetaο»Ώ.
We compute the likelihood and pmf here
fΞΈ(x) = θx(1β€…βˆ’β€…ΞΈ)1β€…βˆ’β€…x,β€…β€Šxβ€„βˆˆβ€„{0, 1}f_ΞΈ(x) = \theta^x(1β€…βˆ’β€…\theta)^{1β€…βˆ’β€…x},\;x \in \{0, 1\}​
so by independence, the likelihood can be written as:
∏i=1nfΞΈ(xi)=ΞΈSN(1βˆ’ΞΈ)Nβˆ’SN\prod_{i=1}^n f_\theta(x_i)=\theta^{S_N}(1-\theta)^{N-S_N}​
where SN=βˆ‘i=1NxiS_N=\sum^N_{i=1}x_i​
The fundamental notion here is that randomness comes from sampling/replicates of the experiment, and all probability statements made in frequentist inference are statements with respect to the probability distribution induced by hypothetical repetitions of the experiment.
This leads us to the basic program of frequentist inference: point estimation, interval estimation, and testing.

1.1 Point estimation

The laws of large numbers imply that the sample mean is in some sense a β€œgood” frequentist estimator for the expectation of a random variable.
We toss a coin N times and use the sample mean of the random variable X= 1headsX = \bold{1}_{heads}ο»Ώ as an estimator of the probability of heads ΞΈ\thetaο»Ώ. Is it a good estimator?
As a frequentist, we might answer that question by computing its mean squared error (MSE)
Mean Square Error (MSE) is var + bias:
E[(XΛ‰Nβˆ’ΞΈ)2]=E[(XΛ‰Nβˆ’EXΛ‰N)2]+(EXNβˆ’ΞΈ)2=var(XΛ‰N)+bias2(XΛ‰N)\mathbb{E}[(\bar{X}_N-\theta)^2]=\mathbb{E}[(\bar{X}_N-\mathbb{E}\bar{X}_N)^2]+(\mathbb{E}X_N-\theta)^2=var(\bar{X}_N)+bias^2({\bar{X}_N})​
Always keep in mind that for frequentists, it is the statistic or estimator – in this case, XNX_Nο»Ώ – that is random, and the parameter ΞΈ\thetaο»Ώ is some fixed, unknown number.
if XΛ‰N=1NSN=1Nβˆ‘i=1Nxiβ€…β€Š\bar{X}_N=\frac{1}{N}S_N=\frac{1}{N}\sum_{i=1}^N{x_i}\;​
then var(SN)=NΞΈ(1βˆ’ΞΈ)var(S_N)=N\theta(1-\theta)ο»Ώ and E(SN)=NΞΈ\mathbb{E}(S_N)=N\theta​
so var(XΛ‰N)=Nβˆ’1ΞΈ(1βˆ’ΞΈ),β€…β€ŠE(XΛ‰N)=ΞΈvar(\bar{X}_N)=N^{-1}\theta(1-\theta),\;\mathbb{E}(\bar{X}_N)=\theta​
Thus, XNX_Nο»Ώ is unbiased
The sample mean is the optimal unbiased estimator in the sense that among all unbiased estimators, it has minimum variance β€”thus minimum MSE, see Rao-Blackwell theorem. The point is that the sample mean is a β€œgood” estimator of ΞΈ\thetaο»Ώ because in a hypothetical infinite sequence of experiments just like the one we performed, it will typically give us a number that is close to ΞΈ\thetaο»Ώ.

1.2 Interval estimation

Point estimation is an exercise in producing a single number that is in some sense a β€œbest guess” at the value of a parameter ΞΈ\thetaο»Ώ. Interval estimation aims to produce a range of values that are β€œplausible”.
What would an exact 1β€…βˆ’Ξ±1β€…βˆ’ Ξ±ο»Ώ confidence interval look like for ΞΈ for N coin tosses?
Let Ο•\phiο»Ώ be the standard Gaussian CDF and SE^\hat{SE}ο»Ώ the standard error of ΞΈ^\hat{\theta}ο»Ώ, SE^(ΞΈ^)=σθ^N\hat{SE}(\hat{\theta})=\frac{\sigma_{\hat{\theta}}}{\sqrt{N}}​
We use the CDF of the binomial distribution to build a Clopper-Pearson interval
Well, we know that SNS_Nο»Ώ has the binomial distribution with parameters NNο»Ώ, ΞΈ\thetaο»Ώ. Its CDF can be expressed as:
FΞΈ(k)=P(S≀k)=βˆ‘j=1k(Nj)ΞΈj(1βˆ’ΞΈ)Nβˆ’jF_{\theta}(k)=P(S\leq k)=\sum_{j=1}^k\binom{N}{j}\theta^j(1-\theta)^{N-j}​
We construct a Clopper-Pearson interval [a(S), b(S)][a(S), b(S)]ο»Ώ with at least (1β€…βˆ’β€…Ξ±)(1β€…βˆ’β€…Ξ±)ο»Ώ coverage by solving:
βˆ‘j=SN(Nj)aj(1βˆ’a)Nβˆ’j=Ξ±/2βˆ‘j=0S(Nj)bj(1βˆ’b)Nβˆ’j=Ξ±/2\sum_{j=S}^N\binom{N}{j}a^j(1-a)^{N-j}=\alpha/2\\\sum_{j=0}^S\binom{N}{j}b^j(1-b)^{N-j}=\alpha/2​
It leads to heavy computations. Instead, we prefer an asymptotical approach using the central limit theorem. The interval takes the general form:
ΞΈ^Β±Ξ¦βˆ’1(1βˆ’Ξ±/2)SE^(ΞΈ^)\hat{\theta} \pm \Phi^{-1}(1-\alpha/2)\hat{SE}(\hat{\theta})​
with (for the coin flipping example):
var(ΞΈ^)=Nβˆ’2var(SN)=Nβˆ’1ΞΈ(1βˆ’ΞΈ)=SE(ΞΈ^)^2var(\hat{\theta})=N^{-2}var(S_N)=N^{-1}\theta(1-\theta)=\hat{SE(\hat{\theta})}^2​
therefore
a=XΛ‰Nβˆ’Ξ¦βˆ’1(Ξ±/2)Nβˆ’1/2XΛ‰N(1βˆ’XΛ‰N)b=XΛ‰N+Ξ¦βˆ’1(Ξ±/2)Nβˆ’1/2XΛ‰N(1βˆ’XΛ‰N)a=\bar{X}_N-\Phi^{-1}(\alpha/2)N^{-1/2}\sqrt{\bar{X}_N(1-\bar{X}_N)}\\b=\bar{X}_N+\Phi^{-1}(\alpha/2)N^{-1/2}\sqrt{\bar{X}_N(1-\bar{X}_N)}​
so that
p(θ∈[a,b])=1βˆ’Ξ±p(\theta \in[a,b])=1-\alpha​
However this approach is too heavy in computation. Instead, we consider intervals that have asymptotic 1β€…βˆ’β€…Ξ±1β€…βˆ’β€…Ξ±ο»Ώ coverage, using the central limit theorem of the general form:
ΞΈ^Β±Ξ¦βˆ’1(1βˆ’Ξ±/2)SE^(ΞΈ^)\hat{\theta} \pm \Phi^{-1}(1-\alpha/2)\hat{SE}(\hat{\theta})​

1.3 Testing

We want to test whether the coin is fair through hypothesis:
H0: θ = 1/2H_0: \theta = 1/2ο»Ώ and H1: θ ≠ 1/2H_1: \theta ≠ 1/2ο»Ώ.
The classical way is to use Neyman’s fixed type I error rate testing. We reject H0H_0ο»Ώ when our estimate ΞΈ^\hat{{\theta}}ο»Ώ is far from 1/2. We use the Binomial distribution of ΞΈ^\hat{\theta}ο»Ώ to find c (according to our chosen type I error rate) such as:
∣θ^β€…βˆ’β€…1/2βˆ£β€„> c|ΞΈΜ‚β€…βˆ’β€…1/2| > c​
Our test is
∣θ^β€…βˆ’β€…1/2βˆ£β€„> c|ΞΈΜ‚β€…βˆ’β€…1/2| > c​
Under the null, the sampling distribution of ΞΈ^\hat{ΞΈ}ο»Ώ is given by
NΞΈ^β€…βˆΌβ€…Binomial(N, 1/2)NΞΈΜ‚β€…βˆΌβ€…Binomial(N, 1/2)​
So to figure out c we compute the probability of rejection (which is no bigger than Ξ±\alphaο»Ώ)
βˆ‘j=0N(1/2βˆ’c)(Nj)2βˆ’N+βˆ‘j=N(1/2+c)N(Nj)2βˆ’N\sum_{j=0}^{N(1/2-c)}\binom{N}{j}2^{-N}+\sum^{N}_{j=N(1/2+c)}\binom{N}{j}2^{-N}​
solving for c<1/2c<1/2ο»Ώ to make the quantity as large as possible while still being less than Ξ±\alphaο»Ώ will give a test with at most Ξ±\alphaο»Ώ type I error rate

2. Bayesian analysis of coin tossing

There are two components here.
Likelihood (also appearing in frequentist inference). It suggests that we are going to consider the success probability ΞΈ\thetaο»Ώ itself to be a random variable.
p(x1,...,xN∣θ)=∏i=1NfΞΈ(xi)=ΞΈSN(1βˆ’ΞΈ)Nβˆ’SNp(x_1,..., x_N|\theta)=\prod^N_{i=1}f_{\theta}(x_i)=\theta^{S_N}(1-\theta)^{N-S_N}​
Prior distribution. It encapsulates our prior beliefs about ΞΈ\thetaο»Ώ before observing data (x1, ..., xN)(x_1, ..., x_N)ο»Ώ. There are a number of possibilities of how we can proceed.
i) Informative prior choice. An informative prior choice would likely put a lot of prior mass near 1/2. However, exactly how much mass to place is tricky. We could try to elicit a prior from β€œexperts” (I guess gamblers?).
ii) Objective prior choice. Prior that will have the smallest possible impact on the outcome of our analysis. Properties close to frequentist, but a poor choice in high-dimensional and non-parametric models.
iii) Empirical prior choice. Estimate the prior from the data, then plug the prior and perform a Bayesian Analysis. Advantage: No somewhat arbitrary prior choice. Shortcoming: Lies in the grey area between the β€œsafety” of frequentist guarantees and that of Bayesian analysis using informative priors based on real prior info. Nice use-case:
These two components allow us to define the posterior:
p(θ∣x)=p(x∣θ)p(ΞΈ)∫p(x∣θ)p(ΞΈ)dΞΈ=p(x∣θ)p(ΞΈ)p(x)p(\theta|{x})=\frac{p({x}|\theta)p(\theta)}{\int p({x}|\theta)p(\theta)d\theta}=\frac{p({x}|\theta)p(\theta)}{p({x})}​
Bayes’ theorem tells us how we should update our prior beliefs about parameters after observing data distributed according to the likelihood.

2.1 Prior choice

A very common choice is to pick a conjugate prior: posterior and prior belong to the same family of probability distributions, so that the marginal likelihood p(x) is often available analytically. For our coin-tossing example, with the N Bernoulli likelihood, our prior is in the Beta family.
We have:
p(x∣θ) = θSN(1β€…βˆ’β€…ΞΈ)Nβ€…βˆ’β€…SNp(x|ΞΈ) = θ^{S_N}(1β€…βˆ’β€…ΞΈ)^{Nβ€…βˆ’β€…S_N}​
so the posterior can be expressed as:
p(θ∣x) = C(x)p(x∣θ)p(ΞΈ)p(ΞΈ|x) = C(x)p(x|ΞΈ)p(ΞΈ)​
We suppose our prior has the same form that our likelihood, with parameters a and b. The posterior becomes:
p(θ∣x) = C(x)ΞΈSNβ€…+β€…a(1β€…βˆ’β€…ΞΈ)Nβ€…βˆ’β€…SN+β€…bp(ΞΈ|x) = C(x)ΞΈ^{S_Nβ€…+β€…a}(1β€…βˆ’β€…ΞΈ)^{Nβ€…βˆ’β€…S_N+β€…b}​
for a>βˆ’1, b>βˆ’1a>βˆ’1, b>βˆ’1ο»Ώ this function is integrable on the unit interval, and the result is the beta function:
C(x)∫01ΞΈSNβ€…+β€…a(1β€…βˆ’β€…ΞΈ)Nβ€…βˆ’β€…SNβ€…+β€…bdθ = C(x)B(SN+a+1, Nβ€…βˆ’β€…SN+b+1) = 1C(x)∫_0^1ΞΈ^{S_Nβ€…+β€…a}(1β€…βˆ’β€…ΞΈ)^{Nβ€…βˆ’β€…SNβ€…+β€…b}dθ = C(x)B(S_N+a+1, Nβ€…βˆ’β€…S_N+b+1) = 1​
so
C(x) = B(SN+a, Nβˆ’SN+b)βˆ’β€…1C(x) = B(S_N+a, Nβˆ’S_N+b)^{βˆ’β€…1}​
and then
p(θ∣x)=ΞΈSN+a(1βˆ’ΞΈ)Nβˆ’SN+bB(SN+a+1,Nβˆ’SN+b+1)p(\theta|{x})=\frac{\theta^{S_N+a}(1-\theta)^{N-S_N+b}}{B(S_N+a+1,N-S_N+b+1)}​

2.2 Point estimation

If the posterior contains everything we want to know about parameters, then we must be able to use it to construct point estimates of the parameters. The posterior is a distribution, and parameters are numbers, so point estimates must be maps from the space of distributions to real numbers (i.e. expectations).
We find the minimum of the integrated mean squared error (IMSE) and even MSE by using the posterior expectation (called the Bayes estimator) as our point estimate.
ΞΈ^B=𝔼[p(θ∣x)]=∫θp(θ∣x)dΞΈΞΈΜ‚_B=𝔼[p(ΞΈ|x)]=∫θp(ΞΈ|x)dθ​

2.3 Interval estimation

One obvious way to construct an interval is to use quantiles of the posterior. An equal tailed interval leaves equal posterior probability to the left and to the right of the endpoints of the interval.
Our interval satisfies
P(ΞΈβ€„βˆˆβ€„[a, b]) = 1β€…βˆ’β€…Ξ±P(ΞΈβ€„βˆˆβ€„[a, b]) = 1β€…βˆ’β€…Ξ±ο»Ώβ€‹
where the probability here is the posterior probability not the probability with respect to hypothetical repeated sampling.
For our example, we use the CDF of the Beta distribution to compute the equal-tailed interval
If we want a (1βˆ’Ξ±)(1 - \alpha)ο»Ώ equal tailed interval, it would be the interval [a,b][a, b]ο»Ώ:
a=supaβ€²{aβ€²:βˆ«βˆ’inf⁑aβ€²p(θ∣x)dΞΈ<Β Ξ±/2}a=sup_{a'}\{a':\int^{a'}_{-\inf}p(\theta|{x})d\theta<\ \alpha/2 \}​
b=infbβ€²{bβ€²:∫bβ€²inf⁑p(θ∣x)dΞΈ<Β Ξ±/2}b=inf_{b'}\{b':\int^{\inf}_{b'}p(\theta|{x})d\theta<\ \alpha/2 \}​
The posterior is
Beta(SNβ€…+β€…a, Nβ€…βˆ’β€…SNβ€…+β€…b)Beta(S_Nβ€…+β€…a, Nβ€…βˆ’β€…S_Nβ€…+β€…b)ο»Ώ, so an equal tailed credible interval can be computed from the quantiles of the Beta distribution. If I(x; a, b)I(x; a, b)ο»Ώ is the CDF of the Beta distribution, then we have:
[Iβˆ’1(Ξ±/2;SN+a,Nβˆ’SN+b),Iβˆ’1(1βˆ’Ξ±/2;SN+a,Nβˆ’SN+b)][I^{-1}(\alpha/2;S_N+a,N-S_N+b),I^{-1}(1-\alpha/2;S_N+a,N-S_N+b)]​

2.4 Hypothesis testing

We again need to be able to compute everything using p(θ∣x)p(ΞΈ|x)ο»Ώ. Hypotheses are subsets of the parameter space – in this case, the subset is just ΞΈ={1/2}\theta=\{1/2\}ο»Ώ.
If the posterior distribution is continuous, then the posterior probability of the null hypothesis is
P(ΞΈ=1/2)=0P(\theta=1/2)=0​
So we need to give nonzero positive probability to the null hypothesis by choosing a mixture prior.
p(ΞΈ)=qΞ΄(ΞΈβ€…βˆ’β€…1/2)+(1β€…βˆ’β€…q)f(ΞΈ)p(ΞΈ)=qΞ΄(ΞΈβ€…βˆ’β€…1/2)+(1β€…βˆ’β€…q)f(ΞΈ)​
where f(θ)f(θ) is the density of a Beta(a,b)Beta(a,b) distribution and q∈[0,1]q\in[0,1]. The left part is associated with H0H_0 and the right with H1H_1.
This is the first thing that might seem odd: in order to carry out the standard Bayes hypothesis test, I have to change my prior.
We compute our new posterior, and we compare it with frequentist p-values in the figure below.
Our prior is:
p(ΞΈ)=qΞ΄(ΞΈβ€…βˆ’β€…1/2)+(1β€…βˆ’β€…q)f(ΞΈ)p(ΞΈ)=qΞ΄(ΞΈβ€…βˆ’β€…1/2)+(1β€…βˆ’β€…q)f(ΞΈ)​
The posterior is now:
p(θ∣x)=C(x)(NSN)ΞΈSN(1βˆ’ΞΈ)Nβˆ’SN{qΞ΄(ΞΈβˆ’1/2)+(1βˆ’q)f(ΞΈ)}p(\theta|{x})=C({x})\binom{N}{S_N}\theta^{S_N}(1-\theta)^{N-S_N}\{q\delta(\theta-1/2)+(1-q)f(\theta)\}​
with
f(ΞΈ)=1B(a,b)ΞΈaβˆ’1(1βˆ’ΞΈ)bβˆ’1f(\theta)=\frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1}​
Integrating only the part involving ΞΈΞΈο»Ώ we find C(x)C(x)ο»Ώ and thus the posterior (left as an exercice for the reader).
So we can compute the posterior probability of the null hypothesis
P(ΞΈ=1/2)=∫1/21/2p(θ∣x)dΞΈ=11+1βˆ’qqB(a+SN,Nβˆ’SN+b)B(a,b)2N{P}(\theta=1/2)=\int_{1/2}^{1/2}p(\theta|{x})d\theta=\frac{1}{1+\frac{1-q}{q}\frac{B(a+S_N, N-S_N+b)}{B(a,b)}2^N}​
More generally, if our null hypothesis is H0:ΞΈ=cH_0:ΞΈ=cο»Ώ, our prior become:
p(ΞΈ)=qΞ΄(ΞΈβ€…βˆ’β€…c)+(1β€…βˆ’β€…q)f(ΞΈ)p(ΞΈ)=qΞ΄(ΞΈβ€…βˆ’β€…c)+(1β€…βˆ’β€…q)f(ΞΈ)​
so our posterior density is
p(θ∣x)=C(x)p(x∣θ){qΞ΄(ΞΈβ€…βˆ’β€…c)+(1β€…βˆ’β€…q)f(ΞΈ)}p(\theta|x)=C(x)p(x|\theta) \{ q\delta(\thetaβ€…βˆ’β€…c)+(1β€…βˆ’β€…q)f(\theta) \}​
integrating the part involving ΞΈ\thetaο»Ώ we find C(x)C(x)​
C(x)βˆ’1=qp(x∣θ=c)+(1βˆ’q)∫p(x∣θ)f(ΞΈ)dΞΈ=qp(x∣γ=0)+(1βˆ’q)p(x∣γ=1)C(x)^{-1}=qp(x|\theta=c)+(1-q)\int p(x|\theta)f(\theta)d\theta=qp(x|\gamma=0)+(1-q)p(x|\gamma=1)​
p(x∣γ=0)p(x|\gamma=0) and p(x∣γ=1)p(x|\gamma=1) are called marginal likelihoods, because they are obtained by integrating the likelihood function over the components of the prior associated with H0H_0 and H1H_1.
So:
P(ΞΈ=c)=∫ccp(θ∣x)dΞΈ=∫ccp(x∣θ){qΞ΄(ΞΈβˆ’c)+(1βˆ’q)f(ΞΈ)}qp(x∣γ=0)+(1βˆ’q)p(x∣γ=1)dΞΈ=qp(x∣γ=0)qp(x∣γ=0)+(1βˆ’q)p(x∣γ=1)=11+1βˆ’qqp(x∣γ=1)p(x∣γ=0){P}(\theta=c)=\int ^c_cp(\theta|{x})d\theta=\int^c_c\frac{p({x}|\theta)\{q\delta(\theta-c)+(1-q)f(\theta)\}}{qp({x}|\gamma=0)+(1-q)p({x}|\gamma=1)}d\theta=\frac{qp({x}|\gamma=0)}{qp({x}|\gamma=0)+(1-q)p({x}|\gamma=1)}=\frac{1}{1+\frac{1-q}{q}\frac{p({x}|\gamma=1)}{p({x}|\gamma=0)}}​
The Bayes Factor is BF(x)=p(x∣γ=1)p(x∣γ=0)BF(x)=\frac{p({x}|\gamma=1)}{p({x}|\gamma=0)}​
Thus, unlike point estimation, where Bayesians and frequentists mostly agree, and interval estimation, where typically the Bayesian credible intervals are fairly similar to frequentist confidence intervals unless a strong prior is chosen, Bayesian hypothesis tests can reach very different conclusions from frequentist ones. In this case, the Bayesian finds less evidence against the null than the frequentist does.

2.5 Objective priors

One of the most commonly used in applications is Jeffreys prior. It is defined as
p(ΞΈ)∝∣I(ΞΈ)βˆ£βˆ’1/2p(\theta) \propto |I(\theta)|^{-1/2}​
where I(ΞΈ) is the Fisher information matrix.
We show that this prior is actually the Beta(1/2,1/2)Beta(1/2,1/2)ο»Ώ prior.
Fisher information is defined as:
I(ΞΈ)=E[(βˆ‚βˆ‚ΞΈlogf(x;ΞΈ))2∣θ]I(\theta)=E[(\frac{\partial}{\partial\theta}logf({x};\theta))^2|\theta]​
so for Bernoulli sampling, we have
βˆ’βˆ‚2βˆ‚ΞΈ{x log(ΞΈ)+(1βˆ’x)log(1βˆ’ΞΈ)}=xΞΈ2+1βˆ’x(1βˆ’ΞΈ)2-\frac{\partial^2}{\partial\theta}\{x\,log(\theta)+(1-x)log(1-\theta)\}=\frac{x}{\theta^2}+\frac{1-x}{(1-\theta)^2}​
and so
I(ΞΈ)=Ex∣θ[xΞΈ2+1βˆ’x(1βˆ’ΞΈ)2]=1ΞΈ+11βˆ’ΞΈ=1ΞΈ(1βˆ’ΞΈ)I(\theta)={E}_{x|\theta}[\frac{x}{\theta^2}+\frac{1-x}{(1-\theta)^2}]= \frac{1}{\theta}+\frac{1}{1-\theta}=\frac{1}{\theta(1-\theta)}​
So Jeffrey prior is
p(ΞΈ)βˆΞΈβˆ’1/2(1βˆ’ΞΈ)βˆ’1/2=Beta(1/2,1/2)p(\theta)\propto\theta^{-1/2}(1-\theta)^{-1/2}=Beta(1/2, 1/2)​
and with a transformation
ΞΌ=g(ΞΈ)ΞΌ=g(ΞΈ)​
we will have
I(ΞΈ) = JTI(ΞΌ)JI(\theta) = J^TI(\mu)J​
where JJο»Ώ is the Jacobian, Jij=βˆ‚ΞΌiβˆ‚ΞΈjJ_{ij}=\frac{\partial \mu_i}{\partial \theta j}ο»Ώ.
Therefore:
∣I(ΞΈ)∣1/2 =β€„βˆ£J∣∣I(ΞΌ)∣1/2|I(ΞΈ)|^{1/2} = |J||I(ΞΌ)|^{1/2}​
Jeffrey prior is invariant under one-to-one reparametrizations which means that if we choose to parametrize the model in terms of
ΞΌ=g(ΞΈ)ΞΌ=g(ΞΈ)​
then the prior will still take the form
∣I(ΞΌ)∣1/2|I(ΞΌ)|^{1/2}​

3 Conjugate families

3.1 The Poisson likelihood

Suppose we observe NNο»Ώ iid data points x1, ..., xNx_1, ..., x_Nο»Ώ from a Poisson(ΞΈ)Poisson(ΞΈ)ο»Ώ distribution. It has the form of a Gamma distribution, so we choose a prior with the same distribution family.
We show that the posterior has also the Gamma form.
We have:
p(x∣θ)=∏i=1Neβˆ’ΞΈΞΈxixi!=eβˆ’NΞΈΞΈSN∏ixi!p({x}|\theta)=\prod^N_{i=1}e^{-\theta}\frac{\theta^{x_i}}{x_i!}=e^{-N\theta}\frac{\theta^{S_N}}{\prod_ix_i!}​
This has the form of the kernel of a Gamma distribution: ΞΈaeβˆ’bΞΈ\theta^ae^{-b\theta }​
So we choose our prior as a Gamma distribution
p(ΞΈ)=baΞ“(a)ΞΈaβˆ’1eβˆ’bΞΈp(\theta)=\frac{b^a}{\Gamma(a)}\theta^{a-1}e^{-b\theta}​
Thus:
p(θ∣x)=C(x)eβˆ’(N+b)ΞΈΞΈSN+aβˆ’1p(\theta|{x})=C({x})e^{-(N+b)\theta}\theta^{S_N+a-1}​
so the posterior is Gamma(SN+a,N+b).Gamma(S_N+a,N+b).​
The Bayes Estimator is therefore Eθ∣x(ΞΈ)=SN+aN+b{E}_{\theta|{x}}(\theta)=\frac{S_N+a}{N+b}​
Which is:
Gamma(a,b)=baΞ“(a)ΞΈaβˆ’1eβˆ’bΞΈGamma(a,b)=\frac{b^a}{\Gamma(a)}\theta^{a-1}e^{-b\theta}​

3.2 The Normal Distribution

The likelihood of the normal distribution looks like the kernel of a Gamma distribution, so we define the prior this way. It is, in fact, the normal-inverse gamma distribution.
We claim this is conjugate, and we find the parameters of the posterior.
The likelihood is:
p(x∣μ,Οƒ2)=(2πσ2)βˆ’N/2exp(βˆ’12βˆ‘i=1N(xiβˆ’ΞΌ)22Οƒ2)p({x}|\mu,\sigma^2)=(2\pi\sigma^2)^{-N/2}exp(-\frac{1}{2}\frac{\sum^N_{i=1}(x_i-\mu)^2}{2\sigma^2})​
if ΞΌ\muο»Ώ is fixed, it has the form of a Gamma distribution for the parameter Οƒβˆ’β€…2\sigma^{βˆ’β€…2}​
So the prior on Οƒβˆ’β€…2Οƒ^{βˆ’β€…2}ο»Ώ is an inverse Gamma distribution
p(Οƒ2)∝(Οƒ2)βˆ’aβˆ’1eβˆ’bΟƒ2p(\sigma^2)\propto(\sigma^2)^{-a-1}e^{-\frac{b}{\sigma^2}}​
Also, if Οƒβ€…βˆ’β€…2Οƒβ€…^{βˆ’β€…2}ο»Ώ was a constant, this would look like the kernel of a normal random variable with variance Οƒβ€…βˆ’β€…2Οƒβ€…^{βˆ’β€…2}ο»Ώ. The conjugate prior is:
p(ΞΌ,Οƒ2)=p(ΞΌβˆ£Οƒ2).p(Οƒ2)=12πσ2Ο„2exp(βˆ’(ΞΌβˆ’m)22Οƒ2Ο„2).baΞ“(a)(Οƒ2)βˆ’aβˆ’1exp(βˆ’bΟƒ2)∝(Οƒ2)βˆ’aβˆ’1βˆ’1/2exp(βˆ’1Οƒ2(b+12Ο„βˆ’2(ΞΌβˆ’m)2))∝NΞ“βˆ’1(m,Ο„2,a,b)p(\mu,\sigma^2)=p(\mu|\sigma^2).p(\sigma^2)\\=\frac{1}{\sqrt{2\pi\sigma^2\tau^2}}exp(-\frac{(\mu-m)^2}{2\sigma^2\tau^2}).\frac{b^a}{\Gamma(a)}(\sigma^2)^{-a-1}exp(-\frac{b}{\sigma^2})\\\propto(\sigma^2)^{-a-1-1/2}exp(-\frac{1}{\sigma^2}(b+\frac{1}{2}\tau^{-2}(\mu-m)^2))\\\propto N\Gamma^{-1}(m, \tau^2, a,b)​
This is the normal-inverse gamma distribution. Finally, our posterior is
p(ΞΌ,Οƒ2∣x)=p(x∣μ,Οƒ2)p(ΞΌ,Οƒ2)β€…β€Šβˆ(Οƒ2)βˆ’N/2βˆ’aβˆ’1βˆ’1/2exp(βˆ’1Οƒ2(b+12Ο„βˆ’2(ΞΌβˆ’m)2+12βˆ‘i=1N(xiβˆ’ΞΌ)2))p(\mu,\sigma^2|{x})=p({x}|\mu,\sigma^2)p(\mu, \sigma^2)\\\;\propto(\sigma^2)^{-N/2-a-1-1/2}exp(-\frac{1}{\sigma^2}(b+\frac{1}{2}\tau^{-2}(\mu-m)^2+\frac{1}{2}\sum^N_{i=1}(x_i-\mu)^2))​
Which is also a normal-inverse gamma distribution. After a few computations on parameters, we find:
p(ΞΌ,Οƒ2∣x)∝NΞ“βˆ’1(Ο„βˆ’2m+NxΛ‰N+Ο„βˆ’2,(N+Ο„βˆ’2)βˆ’1,a+N2,b+12SSE(x)+12NΟ„βˆ’2N+Ο„βˆ’2(xΛ‰βˆ’m)2)p(\mu,\sigma^2|{x})\propto N\Gamma^{-1}(\frac{\tau^{-2}m+N\bar{x}}{N+\tau^{-2}},(N+\tau^{-2})^{-1},a+\frac{N}{2},b+\frac{1}{2}SSE({x})+\frac{1}{2}\frac{N\tau^{-2}}{N+\tau^{-2}}(\bar{x}-m)^2)​
For any NΞ“NΞ“ο»Ώ distribution:
ΞΌβˆ£Οƒ2,xβ€…β€ŠβˆΌβ€…β€ŠN(m,Ο„2Οƒ2)β€…β€ŠβˆΌβ€…β€ŠN(Ο„βˆ’2m+NxΛ‰N+Ο„βˆ’2,Οƒ2(N+Ο„βˆ’2)βˆ’1)\mu|\sigma^2,{x} \;\sim\; N(m,\tau^2 \sigma^2)\;\sim\;N(\frac{\tau^{-2}m+N\bar{x}}{N+\tau^{-2}},\sigma^2(N+\tau^{-2})^{-1})​
So the Bayesian posterior and the frequentist sampling distribution approach one another as the sample size grows.

3.3 Multinomial Distribution

The conjugate prior of the multinomial likelihood is a multivariate of the Beta distribution and is called Dirichlet distribution.
The multinomial likelihood for a d category is given by
p(x1,...,xd∣θ1,...,ΞΈd)=(Nx1!...xd!)∏j=1dΞΈjxjp(x_1,...,x_d|\theta_1,...,\theta_d)=\binom{N}{x_1!...x_d!}\prod^d_{j=1}\theta^{x_j}_j​
The conjugate prior should be a distribution on d-1 dimensional simplex that takes the form
p(ΞΈ)∝∏j=1dΞΈjajβˆ’1p({\theta})\propto\prod^d_{j=1}\theta^{a_j-1}_j​
The resulting probability distribution is called the Dirichlet distribution and has pdf:
p(ΞΈ1,...,ΞΈd∣x1,...,xd)=1B(a)∏j=1dΞΈjajβˆ’11{a∈Sdβˆ’1}p(\theta_1,...,\theta_d|x_1,...,x_d)=\frac{1}{B({a})}\prod^d_{j=1}\theta_j^{a_j-1}{1}\{a\in \mathbb{S}^{d-1}\}​

Next

Thank you for reading, I hope this guide has been helpful for you so far. Now that you took a bite of crunchy Bayesians, we will spice things up in the next article with high-dimension.