Proba ML
5. Decision Theory
5.5 Frequentist Hypothesis Testing

5.5 Frequentist hypothesis testing

The Bayes factor p(H0D)/p(H1D)p(H_0|\mathcal{D})/p(H_1|\mathcal{D}) is expensive to compute since it requires integrating over all parametrization of model H0H_0 and H1H_1. It’s also sensitive to the choice of prior.

5.5.1 Likelihood ratio test

If we use 0-1 loss, and assume that p(H0)=p(H1)p(H_0)=p(H_1), then the optimal decision rule is to accept H0H_0 if:

p(DH0)p(DH1)>1\frac{p(\mathcal{D}|H_0)}{p(\mathcal{D}|H_1)}>1

Gaussian means

If we have two Gaussian distributions with μ0\mu_0 and μ1\mu_1 and known shared variance σ2\sigma^2, the likelihood ratio test is:

p(DH0)p(DH1)=exp(12σ2n=1N(xnμ0)2)exp(12σ2n=1N(xnμ1)2)=exp(12σ2(2Nxˉ(μ0μ1))+Nμ12Nμ02))\begin{align} \frac{p(\mathcal{D}|H_0)}{p(\mathcal{D}|H_1)}&=\frac{\exp\Big(-\frac{1}{2\sigma^2}\sum_{n=1}^N (x_n-\mu_0)^2\Big)}{\exp\Big(-\frac{1}{2\sigma^2}\sum_{n=1}^N (x_n-\mu_1)^2\Big)} \\ &= \exp \Big(\frac{1}{2\sigma^2}(2N\bar{x}(\mu_0-\mu_1))+N\mu_1^2-N\mu_0^2)\Big) \end{align}

Thus the test only depends on the observed data on the sufficient statistic xˉ\bar{x}. From the figure below, we see that we accept H0H_0 if xˉ<x\bar{x}<x^*:

Screen Shot 2023-01-31 at 08.43.40.png

Simple vs compound parameters

In our simple hypothesis test above, parameters were either specified (μ0,μ1\mu_0,\mu_1) or shared (σ2\sigma^2).

A compound hypothesis doesn’t specify all parameters, and we should integrate out these unknown parameters like in the Bayesian hypothesis testing:

p(DH0)p(DH1)=θH0p(θ)pθ(D)θH1p(θ)pθ(D)maxθH0pθ(D)maxθH1pθ(D)\frac{p(\mathcal{D}|H_0)}{p(\mathcal{D}|H_1)}=\frac{\int_{\theta \in H_0} p(\theta)p_{\theta}(\mathcal{D})}{\int_{\theta \in H_1} p(\theta)p_{\theta}(\mathcal{D})} \approx \frac{\max_{\theta \in H_0}p_\theta(\mathcal{D})}{\max_{\theta \in H_1} p_\theta(\mathcal{D})}

As an approximation, we can maximize them out, giving the maximum likelihood ratio.

5.5.2 Null hypothesis significance testing (NHST)

Instead of assuming the 0-1 loss, we design a decision rule with a false positive (error type I) probability of α\alpha, called the significance of the test.

In our Gaussian example:

α(μ0)=p(reject  H0H0  is true)=p(Xˉ(D)>xDH0)=p(Xˉμ0σ/N>xμ0σ/N)\begin{align} \alpha(\mu_0)&=p(\mathrm{reject}\; H_0|H_0\;\mathrm{is\ true})\\ &= p(\bar{X}(\mathcal{D})>x^*|\mathcal{D}\sim H_0) \\ &= p\Big(\frac{\bar{X}-\mu_0}{\sigma/\sqrt{N}}>\frac{x^*-\mu_0}{\sigma /\sqrt{N}}\Big) \end{align}

Hence:

x=zασ/N+μ0x^*=z_\alpha \sigma /\sqrt{N} + \mu_0

with zαz_\alpha the upper α\alpha-quantile of the standard Normal.

Let’s β\beta be the false negative error (error type II) probability:

β(μ1)=p(accept  H0H1is true)\beta(\mu_1)=p(\mathrm{accept\;} H_0|H_1\,\mathrm{is\ true})

The power of a test is 1β(μ1)1-\beta(\mu_1), it is the probability of rejecting H0H_0 when H1H_1 is true

The least power occurs when the two Gaussian overlap: α(μ0)=1β(μ1)\alpha(\mu_0)=1-\beta(\mu_1)

When power(B)power(A)\mathrm{power}(B)\geq\mathrm{power}(A), for the same type I error, BB dominates AA.

5.5.3 p-values

Rather than arbitrarily declaring a result significant or not, we compute its p-value:

pval(test(D))Pr(test(D~)test(D)D~H0)=α\mathrm{pval}(\mathrm{test}(\mathcal{D})) \triangleq \Pr\Big(\mathrm{test}(\mathcal{\tilde{D}})\geq \mathrm{test}(\mathcal{D})|\mathcal{\tilde{D}}\sim H_0\Big)=\alpha

If we accept hypothesis where α=0.05\alpha=0.05, then 95% of the time we will correctly reject H0H_0. However, it doesn’t mean that H1H_1 is true with probability 0.95.

That quantity is given by the Bayesian posterior p(H1D)=0.95p(H_1|\mathcal{D})=0.95

5.5.4 p-values considered harmful

The frequent and invalid reasoning about p-value is:

“If H0H_0 is true, then this test statistic would probably not occur. This statistic did occur, therefore H0H_0 is false“.

This gives us: “If this person is American, he is probably not a member of congress. He is a member of Congress. Therefore he is probably not American”

This is induction: reasoning backward from observed data to probable causes, using statistics regularity and not logical definitions. Logic usually works with deduction: PQP \Rightarrow Q.

To perform induction, we need to compute the probability of H0H_0:

p(H0D)=p(DH0)p(H0)p(DH1)p(H1)+p(DH0)p(H0)=LR1+LRp(H_0|\mathcal{D})=\frac{p(\mathcal{D}|H_0)p(H_0)}{p(\mathcal{D}|H_1)p(H_1)+p(\mathcal{D}|H_0)p(H_0)}=\frac{LR}{1+LR}

when the prior is uniform with p(H0)=p(H1)=0.5p(H_0)=p(H_1)=0.5 and the likelihood ratio LRLR.

If “being an American” is H0H_0 and “being a member of Congress” is D\mathcal{D}, then p(DH0)p(\mathcal{D}|H_0) is low, and p(DH1)p(\mathcal{D}|H_1) is zero, thus the probability of H0H_0 is 1, which follows intuition.

The NHST ignores p(DH1)p(\mathcal{D}|H_1) and also p(H0)p(H_0), hence the wrong results. This is why p-values can be much different from p(H0D)p(H_0|\mathcal{D}).

Screen Shot 2023-02-01 at 08.46.04.png

p(H0"Sig")=p("Sig"H0)p(H0)p("Sig"H0)p(H0)+p("Sig"H1)p(H1)=αp(H0)αp(H0)+(1β)p(H1)0.36\begin{align} p(H_0|"Sig")&=\frac{p("Sig"|H_0)p(H_0)}{p("Sig"|H_0)p(H_0)+p("Sig"|H_1)p(H_1)} \\ &= \frac{\alpha p(H_0)}{\alpha p(H_0)+(1-\beta)p(H_1)} \\ &\approx 0.36 \end{align}

Which is far greater that the 5% probability people often associate with a p-values of α=9/180=0.05\alpha=9/180=0.05

5.5.5 Why isn’t everyone a Bayesian?

The frequentist theory yields counter-intuitive results because it violates the likelihood principle, saying that inference should be made based on prior knowledge, not on some unseen future data.

Bradley Efron wrote Why isn’t everyone a Bayesian (opens in a new tab), stating that if the 19th century was Bayesian and the 20th frequentist, the 21th could be Bayesian again.

Some journals like The American Statistician have already banned or warn against p-values and NHST.

Computation has traditionally been a major road block for Bayesian, which is less of an issue nowadays with fast algorithms and powerful computers.

Also, the Bayesian modeling assumptions can be restraining, but this is true as well for the Frequentist since sampling distribution relies on some hypothesis about data generation.

We can check models empirically using cross-validation, calibration and Bayesian model checking.