Proba ML
9. Linear Discriminant Analysis
9.2 Gaussian Discriminant Analysis

9.2 Gaussian Discriminant Analysis

The GDA is a generative classifier where the class conditional densities are multivariate Gaussian:

p(xy=c,θ)=N(xμc,Σc)p(x|y=c,\theta)=\mathcal{N}(x|\mu_c,\Sigma_c)

The corresponding class posterior has the form:

p(y=cx,θ)πcN(xμc,Σc)p(y=c|x,\theta) \propto \pi_c \mathcal{N}(x|\mu_c,\Sigma_c)

9.2.1 Quadratic decision boundaries

The log posterior over the class labels is:

logp(y=cx,θ)=logπc12log2πΣc12(xμc)Σc1(xμc)+const\log p(y=c|x,\theta)=\log \pi_c- \frac{1}{2}\log |2\pi \Sigma_c | -\frac{1}{2} (x-\mu_c)^\top \Sigma_c^{-1} (x-\mu_c)+const

This is called the discriminant function. The decision function between two classes, cc , and cc', is a quadratic function of xx.

This technique is called Quadratic Discriminant Analysis (QDA).

Screen Shot 2023-06-07 at 09.19.00.png

9.2.2 Linear decision boundaries

We now consider the case of shared covariance, Σc=Σ\Sigma_c=\Sigma:

logp(y=cx,θ)=logπc12(xμc)Σ1(xμc)+const=logπc12μcΣ1μcγc+xΣ1μcβc12xΣx+constκ=γc+xβcκ\begin{align} \log p(y=c|x,\theta)&=\log \pi_c - \frac{1}{2}(x-\mu_c)^\top \Sigma^{-1}(x-\mu_c)+const \\ &= \underbrace{\log \pi_c -\frac{1}{2}\mu_c^\top \Sigma^{-1} \mu_c}_{\gamma_c} +x^\top\underbrace{\Sigma^{-1}\mu_c}_{\beta_c} -\underbrace{\frac{1}{2}x^\top\Sigma x +const}_{\kappa} \\ &= \gamma_c+x^\top \beta_c -\kappa \end{align}

κ\kappa is a residual term across classes that can be removed.

This is called the Linear Discriminant Analysis (LDA).

Screen Shot 2023-06-09 at 09.38.49.png

9.2.3 Connection between LDA and Logistic Regression

From the LDA equation, we can write:

p(y=cx,θ)=eβcx+γcceβcx+γcp(y=c|x,\theta)=\frac{e^{\beta_c^\top x+\gamma_c}}{\sum_{c'}e^{\beta_{c'}^\top x+\gamma_{c'}}}
  • In LDA, we first fit the Gaussians (and class prior) to fit the joint likelihood p(x,yθ)p(x,y|\theta), then we use θ\theta to derive wc=[γc,βc]w_c=[\gamma_c,\beta_c]
  • In Logistic Regression, we estimate ww directly to maximize the conditional likelihood p(yx,θ)p(y|x,\theta)

In the binary case, this resume to:

p(y=1x,θ)=11+exp((β0β1)x+γ0γ1)=σ((β1β0)x+γ1γ0)\begin{align} p(y=1|x,\theta)&=\frac{1}{1+\exp \Big((\beta_0-\beta_1)^\top x+\gamma_0-\gamma_1\Big)}\\&=\sigma((\beta_1-\beta_0)^\top x+\gamma_1-\gamma_0) \end{align}

With:

γ1γ0=12μ0Σ1μ012μ1Σ1μ1+logπ1π0=(μ0μ1)Σ1(μ0μ1)+logπ1π0\begin{align} \gamma_1-\gamma_0&=\frac{1}{2}\mu_0^\top \Sigma^{-1} \mu_0 -\frac{1}{2}\mu_1^\top \Sigma^{-1} \mu_1+\log \frac{\pi_1}{\pi_0} \\ &= (\mu_0-\mu_1)^\top\Sigma^{-1}(\mu_0-\mu_1)+\log \frac{\pi_1}{\pi_0} \end{align}

And by setting:

w=β1β0wx0=(γ1γ0)\begin{align} w&=\beta_1-\beta_0 \\ w^\top x_0&=-(\gamma_1-\gamma_0) \end{align}

We end up with the binary logistic regression:

p(y=1x,θ)=σ(w(xx0))p(y=1|x,\theta)=\sigma\big(w^\top (x-x_0)\big)

The MAP decision rule is:

y^(x)=1iffwx>wx0\hat{y}(x)=1 \quad \mathrm{iff} \quad w^\top x>w^\top x_0

In the case where π0=π1\pi_0=\pi_1 and Σ=σ2I\Sigma=\sigma^2I, we end up with projecting xx on the w=σ2(μ0μ1)w=\sigma^{-2}(\mu_0-\mu_1) line and compare it to the centroid, x0x_0.

Screen Shot 2023-06-11 at 16.31.32.png

9.2.4 Model fitting

We now discuss how to fit the GDA model using MLE. The likelihood is:

p(Dθ)=n=1NCat(ynπ)c=1CN(xnμc,Σc)I(yn=c)p(\mathcal{D}|\theta)=\prod_{n=1}^N \mathrm{Cat}(y_n|\pi)\prod_{c=1}^C\mathcal{N}(x_n|\mu_c,\Sigma_c)^{\mathbb{I}(y_n=c)}

Hence the log-likelihood is:

logp(Dθ)=n=1Nc=1CI(yn=c)logπc+c=1Cn:yn=clogN(xnμc,Σc)\log p(\mathcal{D}|\theta)= \sum_{n=1}^N \sum_{c=1}^C \mathbb{I}(y_n=c)\log \pi_c+ \sum^C_{c=1} \sum_{n:y_n=c} \log\mathcal{N}(x_n|\mu_c,\Sigma_c)

We can optimize πc\pi_c and (μc,Σc)(\mu_c,\Sigma_c) separately:

  • From section 4.2.4, we have that the MLE of the prior is π^c=NcN\hat{\pi}_c=\frac{N_c}{N}
  • From section 4.2.6, the MLE for the Gaussian is:
μ^c=1Ncn:yn=cxnΣ^c=1Ncn:yn=c(xnμ^c)(xnμ^c)\begin{align} \hat{\mu}_c&=\frac{1}{N_c}\sum_{n:y_n=c} x_n \\ \hat{\Sigma}_c &= \frac{1}{N_c}\sum_{n:y_n=c} (x_n-\hat{\mu}_c)(x_n-\hat{\mu}_c)^\top \end{align}

Unfortunately, the MLE for Σ^c\hat{\Sigma}_c can easily overfit if Nc<DN_c<D. Below are some workarounds.

9.2.4.1 Tied covariance

If we tied covariance matrices Σc=Σ\Sigma_c=\Sigma, we get linear boundaries. This also allows pooling from all samples:

Σ^=1Nc=1Cn:yn=c(xnμ^c)(xnμ^c)\hat{\Sigma}=\frac{1}{N}\sum_{c=1}^C \sum_{n:y_n=c}(x_n-\hat{\mu}_c)(x_n-\hat{\mu}_c)^\top

9.2.4.2 Diagonal covariance

If we force Σc\Sigma_c to be diagonal, we reduce the number of parameters from O(CD2)O(C D^2) to O(CD)O(CD), which avoids the overfitting problem. However, we lose the ability to capture the correlation between features (this is the Naive Bayes assumption).

We can restrict further the model using a tied diagonal matrix: this is the diagonal LDA.

9.2.4.3 MAP estimation

Forcing the covariance to be diagonal is a rather strong assumption. Instead, we can perform a MAP estimation from section 4.5.2:

Σ^map=λdiag(Σ^mle)+(1λ)Σ^mle\hat{\Sigma}_{map} = \lambda \mathrm{diag}(\hat{\Sigma}_{mle})+(1-\lambda)\hat{\Sigma}_{mle}

This technique is called regularized discriminant analysis (RDA).

9.2.5 Nearest Centroid Classifier

If we assume a uniform prior over classes, we compute the most probable labels as:

y^(x)=arg maxclogp(y=cx,θ)=arg minc(xμc)Σ1(xμc)\hat{y}(x)=\argmax_c \log p(y=c|x,\theta)=\argmin_c (x-\mu_c)\Sigma^{-1}(x-\mu_c)^\top

This is called the nearest centroid classifier, where the distance is measured using squared Mahalanobis distance.

We can replace this distance with any other distance metrics:

y^(x)=arg mincd2(x,μc)\hat{y}(x)=\argmin_c d^2(x,\mu_c)

We can learn distance metrics with one simple approach:

d2(x,μc)=xμcW2=(xμc)(WW)(xμc)=W(xμc)2d^2(x,\mu_c)=||x-\mu_c||_W^2=(x-\mu_c)^\top(WW^\top)(x-\mu_c) = ||W(x-\mu_c)||^2

The class posterior becomes:

p(y=cx,μ,W)=exp(12W(xμc)22)cexp(12W(xμc22)p(y=c|x,\mu,W)=\frac{\exp(-\frac{1}{2}||W(x-\mu_c)||^2_2)}{\sum_{c'}\exp(-\frac{1}{2}||W(x-\mu_{c'}||^2_2)}

We can optimize WW using gradient descent applied to the discriminative loss. This is called nearest class mean metric learning.

This technique can be used for one-shot learning of new classes since we just need to see a single labeled μc\mu_c per class (assuming we have a good W)W).

9.2.6 Fisher LDA

Discriminant analysis is a generative approach to classification, that requires fitting MVN to the feature. This might be problematic in high dimensions.

An alternative could be to reduce the features to zRKz\in\mathbb{R}^K and then to fit MVN, by finding WRK×DW\in\mathbb{R}^{K\times D} so that z=Wxz=Wx .

  • The simplest way is using a PCA, but this is a non-supervised technique, and not taking the labels into account could lead to suboptimal low-dimensional features
  • Alternatively, we could use a gradient-based method to optimize the log-likelihood, derived from the posterior in the low dimensional space, as we did with the nearest class mean metric learning
  • A third approach relies on eigenvalues decomposition, by finding WW so that the low dimensional data can be classified as well as possible using a Gaussian class-conditional density model: this is Fisher’s linear discriminant analysis (FLDA)

The drawback of FLDA is that we don’t take into account the dimensionality DD, since we reduce it to KC1K\leq C-1.

In the two classes cases, we are seeking a single vector ww onto which we project the data.

9.2.6.1 Derivation of the optimal 1d projection

We now estimate the optimal direction ww. The class-conditional means are:

μk=1Nkn:yn=kxnk{1,2}\mu_k=\frac{1}{N_k}\sum_{n:y_n=k}x_n\quad \forall k\in\{1,2\}

mk=wμkm_k=w^\top \mu_k is the projection of the mean on the line, and zn=wxnz_n=w^\top x_n is the data projection. The variance of the projected data is:

sk2=n:yn=k(znmk)2s_k^2=\sum_{n:y_n=k}(z_n-m_k)^2

The goal is to maximize the distance between the conditional means, while minimizing the variance of the projected clusters. We maximize:

J(w)=(m2m1)2s12+s22=wSBwwSWw\begin{align} J(w)&=\frac{(m_2-m_1)^2}{s^2_1+s^2_2} \\ &= \frac{w^\top S_Bw}{w^\top S_W w} \end{align}

where:

  • SB=(μ2μ1)(μ2μ1)S_B=(\mu_2-\mu_1)(\mu_2-\mu_1)^\top is the between-class scatter matrix
  • SW=n:yn=1(xnμ1)(xnμ1)+n:yn=2(xnμ2)(xnμ2)S_W=\sum_{n:y_n=1}(x_n-\mu_1)(x_n-\mu_1)^\top +\sum_{n:y_n=2}(x_n-\mu_2)(x_n-\mu_2)^\top

J(w)J(w) is maximize when:

SBw=λSWwS_B w=\lambda S_W w

with λ=wSBwwSWw\lambda=\frac{w^\top S_B w}{w^\top S_W w}

Therefore, if SWS_W is invertible:

SW1SBw=λw=SW1(μ2μ1)(m2m1)\begin{align} S_W^{-1}S_Bw &=\lambda w\\ &= S_W^{-1}(\mu_2-\mu_1)(m_2-m_1) \end{align}

Therefore:

wSW1(μ2μ1)w \propto S_W^{-1}(\mu_2-\mu_1)

And since we only care about the direction and not the scale:

w=SW1(μ2μ1)w=S_W^{-1}(\mu_2-\mu_1)

If SWIS_W\propto I, ww is proportional to μ2μ1\mu_2-\mu_1, which is intuitive.

Screen Shot 2023-06-12 at 09.23.13.png

9.2.6.2 Multiple classes

We now define:

zn=Wxnmc=1Ncn:yn=cznm=1NcNcmc\begin{align} z_n&=Wx_n\\ m_c&=\frac{1}{N_c}\sum_{n:y_n=c} z_n\\ m&=\frac{1}{N}\sum_c N_c m_c \end{align}

We have the following scatter matrices:

S~B=c=1CNc(mcm)(mcm)S~W=c=1Cn:yn=c(znmc)(znmc)\begin{align} \tilde{S}_B&=\sum_{c=1}^C N_c (m_c-m)(m_c-m)^\top \\ \tilde{S}_W&=\sum_{c=1}^C \sum_{n:y_n=c}(z_n-m_c)(z_n-m_c)^\top \end{align}

The objective function is:

J(w)=S^BS^W=WSBWWSWWJ(w)=\frac{|\hat{S}_B|}{|\hat{S}_W|}=\frac{|W^\top S_B W|}{|W^\top S_W W|}

This is maximize by:

W=SW1/2UW=S_W^{-1/2}U

where UU are the KK leading eigenvectors of SW1/2SBSW1/2S_W^{-1/2}S_B S_W^{-1/2}, assuming SWS_W non singular.

We see that FLDA discriminates classes better than PCA, but is limited to KC1K\leq C-1 dimensions, which limits its usefulness.

Screen Shot 2023-06-12 at 10.48.11.png