Proba ML
9. Linear Discriminant Analysis
9.3 Naive Bayes Classifiers

9.3 Naive Bayes classifiers

We discuss a simple generative approach where all features are considered conditionally independent given the class label.

Even if this assumption is not valid, it results in good classifiers. Since the model has only O(CD)O(CD) features, it is immune to overfitting.

We use the following class conditional density:

p(xy=c,θ)=d=1Dp(xdy=c,θcd)p(x|y=c,\theta)=\prod_{d=1}^D p(x_d|y=c,\theta_{cd})

Hence, the posterior is:

p(y=cx,θ)=p(y=cπ)d=1Dp(xdy=c,θcd)cp(y=cπ)d=1Dp(xdy=c,θcd)p(y=c|x,\theta)=\frac{p(y=c|\pi)\prod_{d=1}^D p(x_d|y=c,\theta_{cd})}{\sum_{c'} p(y=c'|\pi)\prod_{d=1}^D p(x_d|y=c',\theta_{c'd})}

where πc\pi_c is the prior probability density, and θ=(π,{θdc})\theta=(\pi,\{\theta_{dc}\}) are all the parameters.

9.3.1 Example models

We need to specify the conditional likelihood, given the nature of the features:

  • If xdx_d is binary, we can use the Bernoulli distribution:
p(xy=c,θ)=d=1DBer(xdθdc)p(x|y=c,\theta)=\prod_{d=1}^D Ber(x_d|\theta_{dc})

where θdc\theta_{dc} is the probability that xd=1x_d=1 in class cc.

This is called the Bernoulli multivariate naive model. This approach to MNIST has a surprisingly good 84% accuracy.

Screen Shot 2023-06-12 at 22.21.46.png

  • If xdx_d is categorical, we can use the categorical distribution:
p(xy=c,θ)=d=1DCat(xdθcd)p(x|y=c,\theta)=\prod_{d=1}^D Cat(x_d|\theta_{cd})
  • If xdRx_d \in \mathbb{R}, we use the Gaussian distribution:
p(xy=c,θ)=d=1DN(xdμdc,σdc2)p(x|y=c,\theta)=\prod_{d=1}^D \mathcal{N}(x_d|\mu_{dc},\sigma^2_{dc})

9.3.2 Model fitting

We can fit the Naive Bayes Classifier using the MLE:

p(Dθ)=n=1NCat(ynπ)d=1Dp(xndyn,θd)=n=1NCat(ynπ)d=1Dc=1Cp(xndyn,θdc)I(yn=c)\begin{align} p(\mathcal{D}|\theta)&=\prod_{n=1}^N Cat(y_n|\pi) \prod_{d=1}^D p(x_{nd}|y_n,\theta_{d}) \\ &= \prod_{n=1}^N Cat(y_n|\pi) \prod_{d=1}^D \prod_{c=1}^C p(x_{nd}|y_n,\theta_{dc})^{\mathbb{I}(y_n=c)} \end{align}

So:

logp(Dθ)=n=1Nc=1CI(yn=c)logπc+c=1Cd=1Dn:yn=clogp(xndyn,θdc)\log p(\mathcal{D}|\theta)=\sum_{n=1}^N \sum_{c=1}^C \mathbb{I}(y_n=c)\log \pi_c+\sum_{c=1}^C \sum_{d=1}^D \sum_{n:y_n=c} \log p(x_{nd}|y_n,\theta_{dc})

The MLE for the prior is:

π^c=NcN\hat{\pi}_c=\frac{N_c}{N}

The MLE for θdc\theta_{dc} depends on the class conditional density:

  • If xdx_d is binary, we use the Bernoulli, so:
θ^dc=NdcNc\hat{\theta}_{dc}=\frac{N_{dc}}{N_c}
  • If xdx_d is categorical, we use the Categorical, so:
θ^dck=NdckkNdck=NdckNc\hat{\theta}_{dck}=\frac{N_{dck}}{\sum_{k'}N_{dck'}}=\frac{N_{dck}}{N_{c}}
  • If xdRx_d\in\mathbb{R}, the Gaussian gives us:
μ^cd=1Ncn:yn=cxndσ2=1Ncn:yn=c(xndμ^cd)2\begin{align} \hat{\mu}_{cd}&=\frac{1}{N_c}\sum_{n:y_n=c} x_{nd} \\ \sigma^2&=\frac{1}{N_c}\sum_{n:y_n=c}(x_{nd}-\hat{\mu}_{cd})^2 \end{align}

9.3.3 Bayesian Naive Bayes

We now compute the posterior distribution over the parameters. Let's assume we have categorical features, so:

p(xdθdc)=Cat(xdθdc)p(x_d|\theta_{dc})=Cat(x_d|\theta_{dc})

where θdck=p(xd=ky=c)\theta_{dck}=p(x_d=k|y=c).

We show in section 4.6.3 that the conjugate is the Dirichlet distribution:

p(θdc)=Dir(θdcβdc)p(\theta_{dc})=Dir(\theta_{dc}|\beta_{dc})

Similarly, we use a Dirichlet distribution for the prior:

p(π)=Dir(πα)p(\pi)=Dir(\pi|\alpha)

We can compute the posterior in close form:

p(θD)=Dir(πα˘)d=1Dc=1CDir(θdcβ˘dc)p(\theta|\mathcal{D})=Dir(\pi|\breve{\alpha}) \prod_{d=1}^D\prod_{c=1}^C Dir(\theta_{dc}|\breve{\beta}_{dc})

Where α˘c=αc+Nc\breve{\alpha}_c=\alpha_c+N_c and β˘dck=βdck+Ndck\breve{\beta}_{dck}=\beta_{dck}+N_{dck}

We derive the posterior predictive distribution as follows.

The prior over the label is given by:

p(yD)=Cat(yπˉ)p(y|\mathcal{D})=Cat(y|\bar{\pi})

where πˉc=αc˘cα˘c\bar{\pi}_c=\frac{\breve{\alpha_c}}{\sum_{c'} \breve{\alpha}_{c'}}

For the features, we have:

p(xd=ky=c,D)=θˉdckp(x_d=k|y=c,\mathcal{D})=\bar{\theta}_{dck}

where θˉdck=β˘dckkβ˘dck=βdck+Nnckkβdck+Ndck\bar{\theta}_{dck} = \frac{\breve{\beta}_{dck}}{\sum_{k'}\breve{\beta}}_{dck'} = \frac{\beta_{dck}+N_{nck}}{\sum_{k'}\beta_{dck'}+N_{dck'}}

is the posterior mean of the parameters.

If βdck=0\beta_{dck}=0, this reduces to the MLE

If βdck=1\beta_{dck}=1, we add 1 to the empirical count before normalizing. This is called Laplace smoothing.

In the binary, case this gives us:

θˉdc=1+Ndc12+Ndc\bar{\theta}_{dc}=\frac{1+N_{dc1}}{2+N_{dc}}

Once we have estimated the parameter posterior, the predicted distribution over the label is:

p(y=cx,D)p(y=cD)d=1Dp(xdy=c,D)=πˉcdkθˉdckI(xd=k)p(y=c|x,\mathcal{D}) \propto p(y=c|\mathcal{D})\prod_{d=1}^D p(x_d|y=c,\mathcal{D})=\bar{\pi}_c \prod_{d} \prod_k \bar{\theta}_{dck}^{\mathbb{I}(x_d=k)}

This gives us a fully Bayesian form of naive Bayes, in which we have integrated out all the parameters (the posterior mean parameters)

9.3.4 The connection between naive Bayes and Logistic regression

Let xdk=I(xd=k)x_{dk}=\mathbb{I}(x_d=k), so xdx_{d} is a one-hot encoding of feature dd.

The class conditional density is:

p(xy=c,θ)=dDCat(xdy=c,θ)=dkθdckxdkp(x|y=c,\theta)=\prod_d^D Cat(x_d|y=c,\theta) =\prod_d \prod_k \theta_{dck}^{x_{dk}}

So the posterior over classes is:

p(y=cx,θ)=πcdkθdckxdkcπcdkθdckxdk=exp([logπc+dkxdklogθdck)cexp(logπc+dkxdklogθdck)p(y=c|x,\theta)=\frac{\pi_c \prod_d \prod_k \theta_{dck}^{x_{dk}}} {\sum_{c'}\pi_{c'} \prod_d \prod_k \theta_{dc'k}^{x_{dk}}} =\frac{\exp \big([\log \pi_c + \sum_d \sum_k x_{dk}\log \theta_{dck}\big)} {\sum_{c'} \exp \big(\log \pi_{c'} +\sum_d \sum_k x_{dk}\log \theta_{dc'k}\big)}

which can be written as softmax:

p(y=cx,θ)=exp(βcx+γc)cexp(βcx+γc)p(y=c|x,\theta)=\frac{\exp(\beta_c x+\gamma_c)}{\sum_{c'} \exp (\beta_{c'}x+\gamma_{c'})}

which corresponds to the multinomial logistic regression.

The difference is that the NBC minimizes the join likelihood np(xn,ynθ)\prod_np(x_n,y_n|\theta), whereas the logistic regression minimizes the conditional likelihood np(ynxn,θ)\prod_n p(y_n|x_n,\theta)