Overview
This article is the first of the Bayesian series.
You probably already know the Bayes theorem or stumble upon Naive Bayes when comparing machine learning models. Maybe, like me, you feel that you barely saw the shores of the continent that is the Bayesian land and want to set foot in it.
I wrote this article from a Stanford course with the intent for you to understand the relationships between all Bayesian concepts without having to go through an entire classroom (like I did). As a matter of readability, I hide most of the maths in dropdowns. This way, you can follow the logic easily and come back to deep dive into formulas later.
We start with comparing frequentist and bayesian methods, before focusing on different distributions and conjugate families.
1. Frequentist analysis
Suppose we want to estimate the probability of heads of a (possibly biased) given coin. We flip the coin N times, generating a sequence of observations ο»Ώ of an iid random variable
ο»Ώ, each of which has the Bernoulli distribution with unknown head probability ο»Ώ.
We compute the likelihood and pmf here
ο»Ώβ
so by independence, the likelihood can be written as:
ο»Ώβ
where ο»Ώβ
The fundamental notion here is that randomness comes from sampling/replicates of the experiment, and all probability statements made in frequentist inference are statements with respect to the probability distribution induced by hypothetical repetitions of the experiment.
This leads us to the basic program of frequentist inference: point estimation, interval estimation, and testing.
1.1 Point estimation
The laws of large numbers imply that the sample mean is in some sense a βgoodβ frequentist estimator for the expectation of a random variable.
We toss a coin N times and use the sample mean of the random variable ο»Ώ as an estimator of the probability of heads ο»Ώ. Is it a good estimator?
As a frequentist, we might answer that question by computing its mean squared error (MSE)
Mean Square Error (MSE) is var + bias:
ο»Ώβ
Always keep in mind that for frequentists, it is the statistic or estimator β in this case, ο»Ώ β that is random, and the parameter ο»Ώ is some fixed, unknown number.
if ο»Ώβ
then ο»Ώ and ο»Ώβ
so ο»Ώβ
Thus, ο»Ώ is unbiased
The sample mean is the optimal unbiased estimator in the sense that among all unbiased estimators, it has minimum variance βthus minimum MSE, see Rao-Blackwell theorem. The point is that the sample mean is a βgoodβ estimator of ο»Ώ because in a hypothetical infinite sequence of experiments just like the one we performed, it will typically give us a number that is close to ο»Ώ.
1.2 Interval estimation
Point estimation is an exercise in producing a single number that is in some sense a βbest guessβ at the value of a parameter ο»Ώ. Interval estimation aims to produce a range of values that are βplausibleβ.
What would an exact ο»Ώ confidence interval look like for ΞΈ for N coin tosses?
Let ο»Ώ be the standard Gaussian CDF and ο»Ώ the standard error of ο»Ώ, ο»Ώβ
We use the CDF of the binomial distribution to build a Clopper-Pearson interval
Well, we know that ο»Ώ has the binomial distribution with parameters ο»Ώ, ο»Ώ. Its CDF can be expressed as:
ο»Ώβ
We construct a Clopper-Pearson interval ο»Ώ with at least ο»Ώ coverage by solving:
ο»Ώβ
It leads to heavy computations. Instead, we prefer an asymptotical approach using the central limit theorem. The interval takes the general form:
ο»Ώβ
with (for the coin flipping example):
ο»Ώβ
therefore
ο»Ώβ
so that
ο»Ώβ
However this approach is too heavy in computation. Instead, we consider intervals that have asymptotic ο»Ώ coverage, using the central limit theorem of the general form:
ο»Ώβ
1.3 Testing
We want to test whether the coin is fair through hypothesis:
ο»Ώ and ο»Ώ.
The classical way is to use Neymanβs fixed type I error rate testing. We reject ο»Ώ when our estimate ο»Ώ is far from 1/2. We use the Binomial distribution of ο»Ώ to find c (according to our chosen type I error rate) such as:
ο»Ώβ
Our test is
ο»Ώβ
Under the null, the sampling distribution of ο»Ώ is given by
ο»Ώβ
So to figure out c we compute the probability of rejection (which is no bigger than ο»Ώ)
ο»Ώβ
solving for ο»Ώ to make the quantity as large as possible while still being less than ο»Ώ will give a test with at most ο»Ώ type I error rate
2. Bayesian analysis of coin tossing
There are two components here.
Likelihood (also appearing in frequentist inference). It suggests that we are going to consider the success probability ο»Ώ itself to be a random variable.
ο»Ώβ
Prior distribution. It encapsulates our prior beliefs about ο»Ώ before observing data ο»Ώ. There are a number of possibilities of how we can proceed.
i) Informative prior choice.
An informative prior choice would likely put a lot of prior mass near 1/2. However, exactly how much mass to place is tricky. We could try to elicit a prior from βexpertsβ (I guess gamblers?).
ii) Objective prior choice. Prior that will have the smallest possible impact on the outcome of our analysis. Properties close to frequentist, but a poor choice in high-dimensional and non-parametric models.
iii) Empirical prior choice. Estimate the prior from the data, then plug the prior and perform a Bayesian Analysis. Advantage: No somewhat arbitrary prior choice. Shortcoming: Lies in the grey area between the βsafetyβ of frequentist guarantees and that of Bayesian analysis using informative priors based on real prior info. Nice use-case:
These two components allow us to define the posterior:
ο»Ώβ
Bayesβ theorem tells us how we should update our prior beliefs about parameters after observing data distributed according to the likelihood.
2.1 Prior choice
A very common choice is to pick a conjugate prior: posterior and prior belong to the same family of probability distributions, so that the marginal likelihood p(x) is often available analytically. For our coin-tossing example, with the N Bernoulli likelihood, our prior is in the Beta family.
We have:
ο»Ώβ
so the posterior can be expressed as:
ο»Ώβ
We suppose our prior has the same form that our likelihood, with parameters a and b. The posterior becomes:
ο»Ώβ
for ο»Ώ this function is integrable on the unit interval, and the result is the beta function:
ο»Ώβ
so
ο»Ώβ
and then
ο»Ώβ
2.2 Point estimation
If the posterior contains everything we want to know about parameters, then we must be able to use it to construct point estimates of the parameters. The posterior is a distribution, and parameters are numbers, so point estimates must be maps from the space of distributions to real numbers (i.e. expectations).
We find the minimum of the integrated mean squared error (IMSE) and even MSE by using the posterior expectation (called the Bayes estimator) as our point estimate.
ο»Ώβ
2.3 Interval estimation
One obvious way to construct an interval is to use quantiles of the posterior. An equal tailed interval leaves equal posterior probability to the left and to the right of the endpoints of the interval.
Our interval satisfies
ο»Ώβ
where the probability here is the posterior probability not the probability with respect to hypothetical repeated sampling.
For our example, we use the CDF of the Beta distribution to compute the equal-tailed interval
If we want a ο»Ώ equal tailed interval, it would be the interval ο»Ώ:
ο»Ώβ
ο»Ώβ
The posterior is
ο»Ώ, so an equal tailed credible interval can be computed from the quantiles of the Beta distribution. If ο»Ώ is the CDF of the Beta distribution, then we have:
ο»Ώβ
2.4 Hypothesis testing
We again need to be able to compute everything using ο»Ώ. Hypotheses are subsets of the parameter space β in this case, the subset is just ο»Ώ.
If the posterior distribution is continuous, then the posterior probability of the null hypothesis is
ο»Ώβ
So we need to give nonzero positive probability to the null hypothesis by choosing a mixture prior.
ο»Ώβ
where ο»Ώ is the density of a ο»Ώ distribution and ο»Ώ. The left part is associated with ο»Ώ and the right with ο»Ώ.
This is the first thing that might seem odd: in order to carry out the standard Bayes hypothesis test, I have to change my prior.
We compute our new posterior, and we compare it with frequentist p-values in the figure below.
Our prior is:
ο»Ώβ
The posterior is now:
ο»Ώβ
with
ο»Ώβ
Integrating only the part involving ο»Ώ we find ο»Ώ and thus the posterior (left as an exercice for the reader).
So we can compute the posterior probability of the null hypothesis
ο»Ώβ
More generally, if our null hypothesis is ο»Ώ, our prior become:
ο»Ώβ
so our posterior density is
ο»Ώβ
integrating the part involving ο»Ώ we find ο»Ώβ
ο»Ώβ
ο»Ώ and ο»Ώ are called marginal likelihoods, because they are obtained by integrating the likelihood function over the components of the prior associated with ο»Ώ and ο»Ώ.
So:
ο»Ώβ
The Bayes Factor is ο»Ώβ
Thus, unlike point estimation, where Bayesians and frequentists mostly agree, and interval estimation, where typically the Bayesian credible intervals are fairly similar to frequentist confidence intervals unless a strong prior is chosen, Bayesian hypothesis tests can reach very different conclusions from frequentist ones. In this case, the Bayesian finds less evidence against the null than the frequentist does.
2.5 Objective priors
One of the most commonly used in applications is Jeffreys prior. It is defined as
ο»Ώβ
where I(ΞΈ) is the Fisher information matrix.
We show that this prior is actually the ο»Ώ prior.
Fisher information is defined as:
ο»Ώβ
so for Bernoulli sampling, we have
ο»Ώβ
and so
ο»Ώβ
So Jeffrey prior is
ο»Ώβ
and with a transformation
ο»Ώβ
we will have
ο»Ώβ
where ο»Ώ is the Jacobian, ο»Ώ.
Therefore:
ο»Ώβ
Jeffrey prior is invariant under one-to-one reparametrizations which means that if we choose to parametrize the model in terms of
ο»Ώβ
then the prior will still take the form
ο»Ώβ
3 Conjugate families
3.1 The Poisson likelihood
Suppose we observe ο»Ώ iid data points ο»Ώ from a ο»Ώ distribution. It has the form of a Gamma distribution, so we choose a prior with the same distribution family.
We show that the posterior has also the Gamma form.
We have:
ο»Ώβ
This has the form of the kernel of a Gamma distribution: ο»Ώβ
So we choose our prior as a Gamma distribution
ο»Ώβ
Thus:
ο»Ώβ
so the posterior is ο»Ώβ
The Bayes Estimator is therefore ο»Ώβ
Which is:
ο»Ώβ
3.2 The Normal Distribution
The likelihood of the normal distribution looks like the kernel of a Gamma distribution, so we define the prior this way. It is, in fact, the normal-inverse gamma distribution.
We claim this is conjugate, and we find the parameters of the posterior.
The likelihood is:
ο»Ώβ
if ο»Ώ is fixed, it has the form of a Gamma distribution for the parameter ο»Ώβ
So the prior on ο»Ώ is an inverse Gamma distribution
ο»Ώβ
Also, if ο»Ώ was a constant, this would look like the kernel of a normal random variable with variance ο»Ώ. The conjugate prior is:
ο»Ώβ
This is the normal-inverse gamma distribution. Finally, our posterior is
ο»Ώβ
Which is also a normal-inverse gamma distribution. After a few computations on parameters, we find:
ο»Ώβ
For any ο»Ώ distribution:
ο»Ώβ
So the Bayesian posterior and the frequentist sampling distribution approach one another as the sample size grows.
3.3 Multinomial Distribution
The conjugate prior of the multinomial likelihood is a multivariate of the Beta distribution and is called Dirichlet distribution.
The multinomial likelihood for a d category is given by
ο»Ώβ
The conjugate prior should be a distribution on d-1 dimensional simplex that takes the form
ο»Ώβ
The resulting probability distribution is called the Dirichlet distribution and has pdf:
ο»Ώβ
Next
Thank you for reading, I hope this guide has been helpful for you so far. Now that you took a bite of crunchy Bayesians, we will spice things up in the next article with high-dimension.