A primer on Bayes

Overview

This article is the first of the Bayesian series.

You probably already know the Bayes theorem or stumble upon Naive Bayes when comparing machine learning models. Maybe, like me, you feel that you barely saw the shores of the continent that is the Bayesian land and want to set foot in it.

I wrote this article from a Stanford course with the intent for you to understand the relationships between all Bayesian concepts without having to go through an entire classroom (like I did). As a matter of readability, I hide most of the maths in dropdowns. This way, you can follow the logic easily and come back to deep dive into formulas later.

We start with comparing frequentist and bayesian methods, before focusing on different distributions and conjugate families.

Overview

1. Frequentist analysis

1.1 Point estimation

1.2 Interval estimation

1.3 Testing

2. Bayesian analysis of coin tossing

2.1 Prior choice

2.2 Point estimation

2.3 Interval estimation

2.4 Hypothesis testing

2.5 Objective priors

3 Conjugate families

3.1 The Poisson likelihood

3.2 The Normal Distribution

3.3 Multinomial Distribution

Suppose we want to estimate the probability of heads of a (possibly biased) given coin. We flip the coin N times, generating a sequence of observations (x1, ...,xN)(x_1 , ..., x_N )(x1​, ...,xN​)﻿ of an iid random variable

(X1, ...,XN)(X_1 , ..., X_N)(X1​, ...,XN​)﻿, each of which has the Bernoulli distribution with unknown head probability θ\thetaθ﻿.

We compute the likelihood and pmf here

fθ(x) = θx(1 − θ)1 − x,  x ∈ {0, 1}f_θ(x) = \theta^x(1 − \theta)^{1 − x},\;x \in \{0, 1\}fθ​(x) = θx(1 − θ)1 − x,x ∈ {0, 1}﻿​

so by independence, the likelihood can be written as:

∏i=1nfθ(xi)=θSN(1−θ)N−SN\prod_{i=1}^n f_\theta(x_i)=\theta^{S_N}(1-\theta)^{N-S_N}∏i=1n​fθ​(xi​)=θSN​(1−θ)N−SN​﻿​

where SN=∑i=1NxiS_N=\sum^N_{i=1}x_iSN​=∑i=1N​xi​﻿​

The fundamental notion here is that randomness comes from sampling/replicates of the experiment, and all probability statements made in frequentist inference are statements with respect to the probability distribution induced by hypothetical repetitions of the experiment.

This leads us to the basic program of frequentist inference: point estimation, interval estimation, and testing.

1.1 Point estimation

The laws of large numbers imply that the sample mean is in some sense a “good” frequentist estimator for the expectation of a random variable.

We toss a coin N times and use the sample mean of the random variable X= 1headsX = \bold{1}_{heads}X= 1heads​﻿ as an estimator of the probability of heads θ\thetaθ﻿. Is it a good estimator?

As a frequentist, we might answer that question by computing its mean squared error (MSE)

Mean Square Error (MSE) is var + bias:

E[(XˉN−θ)2]=E[(XˉN−EXˉN)2]+(EXN−θ)2=var(XˉN)+bias2(XˉN)\mathbb{E}[(\bar{X}_N-\theta)^2]=\mathbb{E}[(\bar{X}_N-\mathbb{E}\bar{X}_N)^2]+(\mathbb{E}X_N-\theta)^2=var(\bar{X}_N)+bias^2({\bar{X}_N})E[(XˉN​−θ)2]=E[(XˉN​−EXˉN​)2]+(EXN​−θ)2=var(XˉN​)+bias2(XˉN​)﻿​

Always keep in mind that for frequentists, it is the statistic or estimator – in this case, XNX_NXN​﻿ – that is random, and the parameter θ\thetaθ﻿ is some fixed, unknown number.

if XˉN=1NSN=1N∑i=1Nxi  \bar{X}_N=\frac{1}{N}S_N=\frac{1}{N}\sum_{i=1}^N{x_i}\;XˉN​=N1​SN​=N1​∑i=1N​xi​﻿​

then var(SN)=Nθ(1−θ)var(S_N)=N\theta(1-\theta)var(SN​)=Nθ(1−θ)﻿ and E(SN)=Nθ\mathbb{E}(S_N)=N\thetaE(SN​)=Nθ﻿​

so var(XˉN)=N−1θ(1−θ),  E(XˉN)=θvar(\bar{X}_N)=N^{-1}\theta(1-\theta),\;\mathbb{E}(\bar{X}_N)=\thetavar(XˉN​)=N−1θ(1−θ),E(XˉN​)=θ﻿​

Thus, XNX_NXN​﻿ is unbiased

The sample mean is the optimal unbiased estimator in the sense that among all unbiased estimators, it has minimum variance —thus minimum MSE, see Rao-Blackwell theorem. The point is that the sample mean is a “good” estimator of θ\thetaθ﻿ because in a hypothetical infinite sequence of experiments just like the one we performed, it will typically give us a number that is close to θ\thetaθ﻿.

1.2 Interval estimation

Point estimation is an exercise in producing a single number that is in some sense a “best guess” at the value of a parameter θ\thetaθ﻿. Interval estimation aims to produce a range of values that are “plausible”.

What would an exact 1 −α1 − α1 −α﻿ confidence interval look like for θ for N coin tosses?

Let ϕ\phiϕ﻿ be the standard Gaussian CDF and SE^\hat{SE}SE^﻿ the standard error of θ^\hat{\theta}θ^﻿, SE^(θ^)=σθ^N\hat{SE}(\hat{\theta})=\frac{\sigma_{\hat{\theta}}}{\sqrt{N}}SE^(θ^)=N​σθ^​​﻿​

We use the CDF of the binomial distribution to build a Clopper-Pearson interval

Well, we know that SNS_NSN​﻿ has the binomial distribution with parameters NNN﻿, θ\thetaθ﻿. Its CDF can be expressed as:

Fθ(k)=P(S≤k)=∑j=1k(Nj)θj(1−θ)N−jF_{\theta}(k)=P(S\leq k)=\sum_{j=1}^k\binom{N}{j}\theta^j(1-\theta)^{N-j}Fθ​(k)=P(S≤k)=∑j=1k​(jN​)θj(1−θ)N−j﻿​

We construct a Clopper-Pearson interval [a(S), b(S)][a(S), b(S)][a(S), b(S)]﻿ with at least (1 − α)(1 − α)(1 − α)﻿ coverage by solving:

∑j=SN(Nj)aj(1−a)N−j=α/2∑j=0S(Nj)bj(1−b)N−j=α/2\sum_{j=S}^N\binom{N}{j}a^j(1-a)^{N-j}=\alpha/2\\\sum_{j=0}^S\binom{N}{j}b^j(1-b)^{N-j}=\alpha/2∑j=SN​(jN​)aj(1−a)N−j=α/2∑j=0S​(jN​)bj(1−b)N−j=α/2﻿​

It leads to heavy computations. Instead, we prefer an asymptotical approach using the central limit theorem. The interval takes the general form:

θ^±Φ−1(1−α/2)SE^(θ^)\hat{\theta} \pm \Phi^{-1}(1-\alpha/2)\hat{SE}(\hat{\theta})θ^±Φ−1(1−α/2)SE^(θ^)﻿​

with (for the coin flipping example):

var(θ^)=N−2var(SN)=N−1θ(1−θ)=SE(θ^)^2var(\hat{\theta})=N^{-2}var(S_N)=N^{-1}\theta(1-\theta)=\hat{SE(\hat{\theta})}^2var(θ^)=N−2var(SN​)=N−1θ(1−θ)=SE(θ^)^​2﻿​

therefore

a=XˉN−Φ−1(α/2)N−1/2XˉN(1−XˉN)b=XˉN+Φ−1(α/2)N−1/2XˉN(1−XˉN)a=\bar{X}_N-\Phi^{-1}(\alpha/2)N^{-1/2}\sqrt{\bar{X}_N(1-\bar{X}_N)}\\b=\bar{X}_N+\Phi^{-1}(\alpha/2)N^{-1/2}\sqrt{\bar{X}_N(1-\bar{X}_N)}a=XˉN​−Φ−1(α/2)N−1/2XˉN​(1−XˉN​)​b=XˉN​+Φ−1(α/2)N−1/2XˉN​(1−XˉN​)​﻿​

so that

p(θ∈[a,b])=1−αp(\theta \in[a,b])=1-\alphap(θ∈[a,b])=1−α﻿​

However this approach is too heavy in computation. Instead, we consider intervals that have asymptotic 1 − α1 − α1 − α﻿ coverage, using the central limit theorem of the general form:

θ^±Φ−1(1−α/2)SE^(θ^)\hat{\theta} \pm \Phi^{-1}(1-\alpha/2)\hat{SE}(\hat{\theta})θ^±Φ−1(1−α/2)SE^(θ^)﻿​

1.3 Testing

We want to test whether the coin is fair through hypothesis: 

H0: θ = 1/2H_0: \theta = 1/2H0​: θ = 1/2﻿ and H1: θ ≠ 1/2H_1: \theta ≠ 1/2H1​: θ = 1/2﻿. 

The classical way is to use Neyman’s fixed type I error rate testing. We reject H0H_0H0​﻿ when our estimate θ^\hat{{\theta}}θ^﻿ is far from 1/2. We use the Binomial distribution of θ^\hat{\theta}θ^﻿ to find c (according to our chosen type I error rate) such as:

∣θ^ − 1/2∣ > c|θ̂ − 1/2| > c∣θ^ − 1/2∣ > c﻿​

Our test is

∣θ^ − 1/2∣ > c|θ̂ − 1/2| > c∣θ^ − 1/2∣ > c﻿​

Under the null, the sampling distribution of θ^\hat{θ}θ^﻿ is given by

Nθ^ ∼ Binomial(N, 1/2)Nθ̂ ∼ Binomial(N, 1/2)Nθ^ ∼ Binomial(N, 1/2)﻿​

So to figure out c we compute the probability of rejection (which is no bigger than α\alphaα﻿)

∑j=0N(1/2−c)(Nj)2−N+∑j=N(1/2+c)N(Nj)2−N\sum_{j=0}^{N(1/2-c)}\binom{N}{j}2^{-N}+\sum^{N}_{j=N(1/2+c)}\binom{N}{j}2^{-N}∑j=0N(1/2−c)​(jN​)2−N+∑j=N(1/2+c)N​(jN​)2−N﻿​

solving for c<1/2c<1/2c<1/2﻿ to make the quantity as large as possible while still being less than α\alphaα﻿ will give a test with at most α\alphaα﻿ type I error rate

2. Bayesian analysis of coin tossing

There are two components here.

Likelihood (also appearing in frequentist inference). It suggests that we are going to consider the success probability θ\thetaθ﻿ itself to be a random variable.

p(x1,...,xN∣θ)=∏i=1Nfθ(xi)=θSN(1−θ)N−SNp(x_1,..., x_N|\theta)=\prod^N_{i=1}f_{\theta}(x_i)=\theta^{S_N}(1-\theta)^{N-S_N}p(x1​,...,xN​∣θ)=∏i=1N​fθ​(xi​)=θSN​(1−θ)N−SN​﻿​

Prior distribution. It encapsulates our prior beliefs about θ\thetaθ﻿ before observing data (x1, ..., xN)(x_1, ..., x_N)(x1​, ..., xN​)﻿. There are a number of possibilities of how we can proceed.

i) Informative prior choice.
An informative prior choice would likely put a lot of prior mass near 1/2. However, exactly how much mass to place is tricky. We could try to elicit a prior from “experts” (I guess gamblers?).

ii) Objective prior choice. Prior that will have the smallest possible impact on the outcome of our analysis. Properties close to frequentist, but a poor choice in high-dimensional and non-parametric models.

iii) Empirical prior choice. Estimate the prior from the data, then plug the prior and perform a Bayesian Analysis. Advantage: No somewhat arbitrary prior choice. Shortcoming: Lies in the grey area between the “safety” of frequentist guarantees and that of Bayesian analysis using informative priors based on real prior info. Nice use-case:

http://varianceexplained.org/r/empirical_bayes_baseball/

These two components allow us to define the posterior:

p(θ∣x)=p(x∣θ)p(θ)∫p(x∣θ)p(θ)dθ=p(x∣θ)p(θ)p(x)p(\theta|{x})=\frac{p({x}|\theta)p(\theta)}{\int p({x}|\theta)p(\theta)d\theta}=\frac{p({x}|\theta)p(\theta)}{p({x})}p(θ∣x)=∫p(x∣θ)p(θ)dθp(x∣θ)p(θ)​=p(x)p(x∣θ)p(θ)​﻿​

Bayes’ theorem tells us how we should update our prior beliefs about parameters after observing data distributed according to the likelihood.

2.1 Prior choice

A very common choice is to pick a conjugate prior: posterior and prior belong to the same family of probability distributions, so that the marginal likelihood p(x) is often available analytically. For our coin-tossing example, with the N Bernoulli likelihood, our prior is in the Beta family.

We have:

p(x∣θ) = θSN(1 − θ)N − SNp(x|θ) = θ^{S_N}(1 − θ)^{N − S_N}p(x∣θ) = θSN​(1 − θ)N − SN​﻿​

so the posterior can be expressed as:

p(θ∣x) = C(x)p(x∣θ)p(θ)p(θ|x) = C(x)p(x|θ)p(θ)p(θ∣x) = C(x)p(x∣θ)p(θ)﻿​

We suppose our prior has the same form that our likelihood, with parameters a and b. The posterior becomes:

p(θ∣x) = C(x)θSN + a(1 − θ)N − SN+ bp(θ|x) = C(x)θ^{S_N + a}(1 − θ)^{N − S_N+ b}p(θ∣x) = C(x)θSN​ + a(1 − θ)N − SN​+ b﻿​

for a>−1, b>−1a>−1, b>−1a>−1, b>−1﻿ this function is integrable on the unit interval, and the result is the beta function:

C(x)∫01θSN + a(1 − θ)N − SN + bdθ = C(x)B(SN+a+1, N − SN+b+1) = 1C(x)∫_0^1θ^{S_N + a}(1 − θ)^{N − SN + b}dθ = C(x)B(S_N+a+1, N − S_N+b+1) = 1C(x)∫01​θSN​ + a(1 − θ)N − SN + bdθ = C(x)B(SN​+a+1, N − SN​+b+1) = 1﻿​

so

C(x) = B(SN+a, N−SN+b)− 1C(x) = B(S_N+a, N−S_N+b)^{− 1}C(x) = B(SN​+a, N−SN​+b)− 1﻿​

and then

p(θ∣x)=θSN+a(1−θ)N−SN+bB(SN+a+1,N−SN+b+1)p(\theta|{x})=\frac{\theta^{S_N+a}(1-\theta)^{N-S_N+b}}{B(S_N+a+1,N-S_N+b+1)}p(θ∣x)=B(SN​+a+1,N−SN​+b+1)θSN​+a(1−θ)N−SN​+b​﻿​

2.2 Point estimation

If the posterior contains everything we want to know about parameters, then we must be able to use it to construct point estimates of the parameters. The posterior is a distribution, and parameters are numbers, so point estimates must be maps from the space of distributions to real numbers (i.e. expectations).

We find the minimum of the integrated mean squared error (IMSE) and even MSE by using the posterior expectation (called the Bayes estimator) as our point estimate.

θ^B=𝔼[p(θ∣x)]=∫θp(θ∣x)dθθ̂_B=𝔼[p(θ|x)]=∫θp(θ|x)dθθ^B​=E[p(θ∣x)]=∫θp(θ∣x)dθ﻿​

2.3 Interval estimation

One obvious way to construct an interval is to use quantiles of the posterior. An equal tailed interval leaves equal posterior probability to the left and to the right of the endpoints of the interval.

Our interval satisfies

P(θ ∈ [a, b]) = 1 − αP(θ ∈ [a, b]) = 1 − αP(θ ∈ [a, b]) = 1 − α﻿​

where the probability here is the posterior probability not the probability with respect to hypothetical repeated sampling.

For our example, we use the CDF of the Beta distribution to compute the equal-tailed interval

If we want a (1−α)(1 - \alpha)(1−α)﻿ equal tailed interval, it would be the interval [a,b][a, b][a,b]﻿:

a=supa′{a′:∫−inf⁡a′p(θ∣x)dθ< α/2}a=sup_{a'}\{a':\int^{a'}_{-\inf}p(\theta|{x})d\theta<\ \alpha/2 \}a=supa′​{a′:∫−infa′​p(θ∣x)dθ< α/2}﻿​

b=infb′{b′:∫b′inf⁡p(θ∣x)dθ< α/2}b=inf_{b'}\{b':\int^{\inf}_{b'}p(\theta|{x})d\theta<\ \alpha/2 \}b=infb′​{b′:∫b′inf​p(θ∣x)dθ< α/2}﻿​

The posterior is

Beta(SN + a, N − SN + b)Beta(S_N + a, N − S_N + b)Beta(SN​ + a, N − SN​ + b)﻿, so an equal tailed credible interval can be computed from the quantiles of the Beta distribution. If I(x; a, b)I(x; a, b)I(x; a, b)﻿ is the CDF of the Beta distribution, then we have:

[I−1(α/2;SN+a,N−SN+b),I−1(1−α/2;SN+a,N−SN+b)][I^{-1}(\alpha/2;S_N+a,N-S_N+b),I^{-1}(1-\alpha/2;S_N+a,N-S_N+b)][I−1(α/2;SN​+a,N−SN​+b),I−1(1−α/2;SN​+a,N−SN​+b)]﻿​

2.4 Hypothesis testing

We again need to be able to compute everything using p(θ∣x)p(θ|x)p(θ∣x)﻿. Hypotheses are subsets of the parameter space – in this case, the subset is just θ={1/2}\theta=\{1/2\}θ={1/2}﻿.

If the posterior distribution is continuous, then the posterior probability of the null hypothesis is

P(θ=1/2)=0P(\theta=1/2)=0P(θ=1/2)=0﻿​

So we need to give nonzero positive probability to the null hypothesis by choosing a mixture prior.

p(θ)=qδ(θ − 1/2)+(1 − q)f(θ)p(θ)=qδ(θ − 1/2)+(1 − q)f(θ)p(θ)=qδ(θ − 1/2)+(1 − q)f(θ)﻿​

where f(θ)f(θ)f(θ)﻿ is the density of a Beta(a,b)Beta(a,b)Beta(a,b)﻿ distribution and q∈[0,1]q\in[0,1]q∈[0,1]﻿. The left part is associated with H0H_0H0​﻿ and the right with H1H_1H1​﻿.

This is the first thing that might seem odd: in order to carry out the standard Bayes hypothesis test, I have to change my prior.

We compute our new posterior, and we compare it with frequentist p-values in the figure below.

Our prior is:

p(θ)=qδ(θ − 1/2)+(1 − q)f(θ)p(θ)=qδ(θ − 1/2)+(1 − q)f(θ)p(θ)=qδ(θ − 1/2)+(1 − q)f(θ)﻿​

The posterior is now:

p(θ∣x)=C(x)(NSN)θSN(1−θ)N−SN{qδ(θ−1/2)+(1−q)f(θ)}p(\theta|{x})=C({x})\binom{N}{S_N}\theta^{S_N}(1-\theta)^{N-S_N}\{q\delta(\theta-1/2)+(1-q)f(\theta)\}p(θ∣x)=C(x)(SN​N​)θSN​(1−θ)N−SN​{qδ(θ−1/2)+(1−q)f(θ)}﻿​

with

f(θ)=1B(a,b)θa−1(1−θ)b−1f(\theta)=\frac{1}{B(a,b)}\theta^{a-1}(1-\theta)^{b-1}f(θ)=B(a,b)1​θa−1(1−θ)b−1﻿​

Integrating only the part involving θθθ﻿ we find C(x)C(x)C(x)﻿ and thus the posterior (left as an exercice for the reader).

So we can compute the posterior probability of the null hypothesis

P(θ=1/2)=∫1/21/2p(θ∣x)dθ=11+1−qqB(a+SN,N−SN+b)B(a,b)2N{P}(\theta=1/2)=\int_{1/2}^{1/2}p(\theta|{x})d\theta=\frac{1}{1+\frac{1-q}{q}\frac{B(a+S_N, N-S_N+b)}{B(a,b)}2^N}P(θ=1/2)=∫1/21/2​p(θ∣x)dθ=1+q1−q​B(a,b)B(a+SN​,N−SN​+b)​2N1​﻿​

More generally, if our null hypothesis is H0:θ=cH_0:θ=cH0​:θ=c﻿, our prior become:

p(θ)=qδ(θ − c)+(1 − q)f(θ)p(θ)=qδ(θ − c)+(1 − q)f(θ)p(θ)=qδ(θ − c)+(1 − q)f(θ)﻿​

so our posterior density is

p(θ∣x)=C(x)p(x∣θ){qδ(θ − c)+(1 − q)f(θ)}p(\theta|x)=C(x)p(x|\theta) \{ q\delta(\theta − c)+(1 − q)f(\theta) \}p(θ∣x)=C(x)p(x∣θ){qδ(θ − c)+(1 − q)f(θ)}﻿​

integrating the part involving θ\thetaθ﻿ we find C(x)C(x)C(x)﻿​

C(x)−1=qp(x∣θ=c)+(1−q)∫p(x∣θ)f(θ)dθ=qp(x∣γ=0)+(1−q)p(x∣γ=1)C(x)^{-1}=qp(x|\theta=c)+(1-q)\int p(x|\theta)f(\theta)d\theta=qp(x|\gamma=0)+(1-q)p(x|\gamma=1)C(x)−1=qp(x∣θ=c)+(1−q)∫p(x∣θ)f(θ)dθ=qp(x∣γ=0)+(1−q)p(x∣γ=1)﻿​

p(x∣γ=0)p(x|\gamma=0)p(x∣γ=0)﻿ and p(x∣γ=1)p(x|\gamma=1)p(x∣γ=1)﻿ are called marginal likelihoods, because they are obtained by integrating the likelihood function over the components of the prior associated with H0H_0H0​﻿ and H1H_1H1​﻿.

So:

P(θ=c)=∫ccp(θ∣x)dθ=∫ccp(x∣θ){qδ(θ−c)+(1−q)f(θ)}qp(x∣γ=0)+(1−q)p(x∣γ=1)dθ=qp(x∣γ=0)qp(x∣γ=0)+(1−q)p(x∣γ=1)=11+1−qqp(x∣γ=1)p(x∣γ=0){P}(\theta=c)=\int ^c_cp(\theta|{x})d\theta=\int^c_c\frac{p({x}|\theta)\{q\delta(\theta-c)+(1-q)f(\theta)\}}{qp({x}|\gamma=0)+(1-q)p({x}|\gamma=1)}d\theta=\frac{qp({x}|\gamma=0)}{qp({x}|\gamma=0)+(1-q)p({x}|\gamma=1)}=\frac{1}{1+\frac{1-q}{q}\frac{p({x}|\gamma=1)}{p({x}|\gamma=0)}}P(θ=c)=∫cc​p(θ∣x)dθ=∫cc​qp(x∣γ=0)+(1−q)p(x∣γ=1)p(x∣θ){qδ(θ−c)+(1−q)f(θ)}​dθ=qp(x∣γ=0)+(1−q)p(x∣γ=1)qp(x∣γ=0)​=1+q1−q​p(x∣γ=0)p(x∣γ=1)​1​﻿​

The Bayes Factor is BF(x)=p(x∣γ=1)p(x∣γ=0)BF(x)=\frac{p({x}|\gamma=1)}{p({x}|\gamma=0)}BF(x)=p(x∣γ=0)p(x∣γ=1)​﻿​

Thus, unlike point estimation, where Bayesians and frequentists mostly agree, and interval estimation, where typically the Bayesian credible intervals are fairly similar to frequentist confidence intervals unless a strong prior is chosen, Bayesian hypothesis tests can reach very different conclusions from frequentist ones. In this case, the Bayesian finds less evidence against the null than the frequentist does.

2.5 Objective priors

One of the most commonly used in applications is Jeffreys prior. It is defined as

p(θ)∝∣I(θ)∣−1/2p(\theta) \propto |I(\theta)|^{-1/2}p(θ)∝∣I(θ)∣−1/2﻿​

where I(θ) is the Fisher information matrix.

We show that this prior is actually the Beta(1/2,1/2)Beta(1/2,1/2)Beta(1/2,1/2)﻿ prior.

Fisher information is defined as:

I(θ)=E[(∂∂θlogf(x;θ))2∣θ]I(\theta)=E[(\frac{\partial}{\partial\theta}logf({x};\theta))^2|\theta]I(θ)=E[(∂θ∂​logf(x;θ))2∣θ]﻿​

so for Bernoulli sampling, we have

−∂2∂θ{x log(θ)+(1−x)log(1−θ)}=xθ2+1−x(1−θ)2-\frac{\partial^2}{\partial\theta}\{x\,log(\theta)+(1-x)log(1-\theta)\}=\frac{x}{\theta^2}+\frac{1-x}{(1-\theta)^2}−∂θ∂2​{xlog(θ)+(1−x)log(1−θ)}=θ2x​+(1−θ)21−x​﻿​

and so

I(θ)=Ex∣θ[xθ2+1−x(1−θ)2]=1θ+11−θ=1θ(1−θ)I(\theta)={E}_{x|\theta}[\frac{x}{\theta^2}+\frac{1-x}{(1-\theta)^2}]= \frac{1}{\theta}+\frac{1}{1-\theta}=\frac{1}{\theta(1-\theta)}I(θ)=Ex∣θ​[θ2x​+(1−θ)21−x​]=θ1​+1−θ1​=θ(1−θ)1​﻿​

So Jeffrey prior is

p(θ)∝θ−1/2(1−θ)−1/2=Beta(1/2,1/2)p(\theta)\propto\theta^{-1/2}(1-\theta)^{-1/2}=Beta(1/2, 1/2)p(θ)∝θ−1/2(1−θ)−1/2=Beta(1/2,1/2)﻿​

and with a transformation

μ=g(θ)μ=g(θ)μ=g(θ)﻿​

we will have

I(θ) = JTI(μ)JI(\theta) = J^TI(\mu)JI(θ) = JTI(μ)J﻿​

where JJJ﻿ is the Jacobian, Jij=∂μi∂θjJ_{ij}=\frac{\partial \mu_i}{\partial \theta j}Jij​=∂θj∂μi​​﻿. 

Therefore:

∣I(θ)∣1/2 = ∣J∣∣I(μ)∣1/2|I(θ)|^{1/2} = |J||I(μ)|^{1/2}∣I(θ)∣1/2 = ∣J∣∣I(μ)∣1/2﻿​

Jeffrey prior is invariant under one-to-one reparametrizations which means that if we choose to parametrize the model in terms of

μ=g(θ)μ=g(θ)μ=g(θ)﻿​

then the prior will still take the form

∣I(μ)∣1/2|I(μ)|^{1/2}∣I(μ)∣1/2﻿​

3 Conjugate families

3.1 The Poisson likelihood

Suppose we observe NNN﻿ iid data points x1, ..., xNx_1, ..., x_Nx1​, ..., xN​﻿ from a Poisson(θ)Poisson(θ)Poisson(θ)﻿ distribution. It has the form of a Gamma distribution, so we choose a prior with the same distribution family.

We show that the posterior has also the Gamma form.

We have:

p(x∣θ)=∏i=1Ne−θθxixi!=e−NθθSN∏ixi!p({x}|\theta)=\prod^N_{i=1}e^{-\theta}\frac{\theta^{x_i}}{x_i!}=e^{-N\theta}\frac{\theta^{S_N}}{\prod_ix_i!}p(x∣θ)=∏i=1N​e−θxi​!θxi​​=e−Nθ∏i​xi​!θSN​​﻿​

This has the form of the kernel of a Gamma distribution: θae−bθ\theta^ae^{-b\theta }θae−bθ﻿​

So we choose our prior as a Gamma distribution

p(θ)=baΓ(a)θa−1e−bθp(\theta)=\frac{b^a}{\Gamma(a)}\theta^{a-1}e^{-b\theta}p(θ)=Γ(a)ba​θa−1e−bθ﻿​

Thus:

p(θ∣x)=C(x)e−(N+b)θθSN+a−1p(\theta|{x})=C({x})e^{-(N+b)\theta}\theta^{S_N+a-1}p(θ∣x)=C(x)e−(N+b)θθSN​+a−1﻿​

so the posterior is Gamma(SN+a,N+b).Gamma(S_N+a,N+b).Gamma(SN​+a,N+b).﻿​

The Bayes Estimator is therefore Eθ∣x(θ)=SN+aN+b{E}_{\theta|{x}}(\theta)=\frac{S_N+a}{N+b}Eθ∣x​(θ)=N+bSN​+a​﻿​

Which is:

Gamma(a,b)=baΓ(a)θa−1e−bθGamma(a,b)=\frac{b^a}{\Gamma(a)}\theta^{a-1}e^{-b\theta}Gamma(a,b)=Γ(a)ba​θa−1e−bθ﻿​

3.2 The Normal Distribution

The likelihood of the normal distribution looks like the kernel of a Gamma distribution, so we define the prior this way. It is, in fact, the normal-inverse gamma distribution.

We claim this is conjugate, and we find the parameters of the posterior.

The likelihood is:

p(x∣μ,σ2)=(2πσ2)−N/2exp(−12∑i=1N(xi−μ)22σ2)p({x}|\mu,\sigma^2)=(2\pi\sigma^2)^{-N/2}exp(-\frac{1}{2}\frac{\sum^N_{i=1}(x_i-\mu)^2}{2\sigma^2})p(x∣μ,σ2)=(2πσ2)−N/2exp(−21​2σ2∑i=1N​(xi​−μ)2​)﻿​

if μ\muμ﻿ is fixed, it has the form of a Gamma distribution for the parameter σ− 2\sigma^{− 2}σ− 2﻿​

So the prior on σ− 2σ^{− 2}σ− 2﻿ is an inverse Gamma distribution

p(σ2)∝(σ2)−a−1e−bσ2p(\sigma^2)\propto(\sigma^2)^{-a-1}e^{-\frac{b}{\sigma^2}}p(σ2)∝(σ2)−a−1e−σ2b​﻿​

Also, if σ − 2σ ^{− 2}σ − 2﻿ was a constant, this would look like the kernel of a normal random variable with variance σ − 2σ ^{− 2}σ − 2﻿. The conjugate prior is:

p(μ,σ2)=p(μ∣σ2).p(σ2)=12πσ2τ2exp(−(μ−m)22σ2τ2).baΓ(a)(σ2)−a−1exp(−bσ2)∝(σ2)−a−1−1/2exp(−1σ2(b+12τ−2(μ−m)2))∝NΓ−1(m,τ2,a,b)p(\mu,\sigma^2)=p(\mu|\sigma^2).p(\sigma^2)\\=\frac{1}{\sqrt{2\pi\sigma^2\tau^2}}exp(-\frac{(\mu-m)^2}{2\sigma^2\tau^2}).\frac{b^a}{\Gamma(a)}(\sigma^2)^{-a-1}exp(-\frac{b}{\sigma^2})\\\propto(\sigma^2)^{-a-1-1/2}exp(-\frac{1}{\sigma^2}(b+\frac{1}{2}\tau^{-2}(\mu-m)^2))\\\propto N\Gamma^{-1}(m, \tau^2, a,b)p(μ,σ2)=p(μ∣σ2).p(σ2)=2πσ2τ2​1​exp(−2σ2τ2(μ−m)2​).Γ(a)ba​(σ2)−a−1exp(−σ2b​)∝(σ2)−a−1−1/2exp(−σ21​(b+21​τ−2(μ−m)2))∝NΓ−1(m,τ2,a,b)﻿​

This is the normal-inverse gamma distribution. Finally, our posterior is

p(μ,σ2∣x)=p(x∣μ,σ2)p(μ,σ2)  ∝(σ2)−N/2−a−1−1/2exp(−1σ2(b+12τ−2(μ−m)2+12∑i=1N(xi−μ)2))p(\mu,\sigma^2|{x})=p({x}|\mu,\sigma^2)p(\mu, \sigma^2)\\\;\propto(\sigma^2)^{-N/2-a-1-1/2}exp(-\frac{1}{\sigma^2}(b+\frac{1}{2}\tau^{-2}(\mu-m)^2+\frac{1}{2}\sum^N_{i=1}(x_i-\mu)^2))p(μ,σ2∣x)=p(x∣μ,σ2)p(μ,σ2)∝(σ2)−N/2−a−1−1/2exp(−σ21​(b+21​τ−2(μ−m)2+21​∑i=1N​(xi​−μ)2))﻿​

Which is also a normal-inverse gamma distribution. After a few computations on parameters, we find:

p(μ,σ2∣x)∝NΓ−1(τ−2m+NxˉN+τ−2,(N+τ−2)−1,a+N2,b+12SSE(x)+12Nτ−2N+τ−2(xˉ−m)2)p(\mu,\sigma^2|{x})\propto N\Gamma^{-1}(\frac{\tau^{-2}m+N\bar{x}}{N+\tau^{-2}},(N+\tau^{-2})^{-1},a+\frac{N}{2},b+\frac{1}{2}SSE({x})+\frac{1}{2}\frac{N\tau^{-2}}{N+\tau^{-2}}(\bar{x}-m)^2)p(μ,σ2∣x)∝NΓ−1(N+τ−2τ−2m+Nxˉ​,(N+τ−2)−1,a+2N​,b+21​SSE(x)+21​N+τ−2Nτ−2​(xˉ−m)2)﻿​

For any NΓNΓNΓ﻿ distribution:

μ∣σ2,x  ∼  N(m,τ2σ2)  ∼  N(τ−2m+NxˉN+τ−2,σ2(N+τ−2)−1)\mu|\sigma^2,{x} \;\sim\; N(m,\tau^2 \sigma^2)\;\sim\;N(\frac{\tau^{-2}m+N\bar{x}}{N+\tau^{-2}},\sigma^2(N+\tau^{-2})^{-1})μ∣σ2,x∼N(m,τ2σ2)∼N(N+τ−2τ−2m+Nxˉ​,σ2(N+τ−2)−1)﻿​

So the Bayesian posterior and the frequentist sampling distribution approach one another as the sample size grows.

3.3 Multinomial Distribution

The conjugate prior of the multinomial likelihood is a multivariate of the Beta distribution and is called Dirichlet distribution.

The multinomial likelihood for a d category is given by

p(x1,...,xd∣θ1,...,θd)=(Nx1!...xd!)∏j=1dθjxjp(x_1,...,x_d|\theta_1,...,\theta_d)=\binom{N}{x_1!...x_d!}\prod^d_{j=1}\theta^{x_j}_jp(x1​,...,xd​∣θ1​,...,θd​)=(x1​!...xd​!N​)∏j=1d​θjxj​​﻿​

The conjugate prior should be a distribution on d-1 dimensional simplex that takes the form

p(θ)∝∏j=1dθjaj−1p({\theta})\propto\prod^d_{j=1}\theta^{a_j-1}_jp(θ)∝∏j=1d​θjaj​−1​﻿​

The resulting probability distribution is called the Dirichlet distribution and has pdf:

p(θ1,...,θd∣x1,...,xd)=1B(a)∏j=1dθjaj−11{a∈Sd−1}p(\theta_1,...,\theta_d|x_1,...,x_d)=\frac{1}{B({a})}\prod^d_{j=1}\theta_j^{a_j-1}{1}\{a\in \mathbb{S}^{d-1}\}p(θ1​,...,θd​∣x1​,...,xd​)=B(a)1​∏j=1d​θjaj​−1​1{a∈Sd−1}﻿​

Thank you for reading, I hope this guide has been helpful for you so far. Now that you took a bite of crunchy Bayesians, we will spice things up in the next article with high-dimension.

A primer on Bayes

Overview

1. Frequentist analysis

1.1 Point estimation

1.2 Interval estimation

1.3 Testing

2. Bayesian analysis of coin tossing

2.1 Prior choice

2.2 Point estimation

2.3 Interval estimation

2.4 Hypothesis testing

2.5 Objective priors

3 Conjugate families

3.1 The Poisson likelihood

3.2 The Normal Distribution

3.3 Multinomial Distribution

Next