Proba ML
9. Linear Discriminant Analysis
9.4 Generative vs Discriminative Classifiers

9.4 Generative vs Discriminative classifiers

A model of the form p(x,y)=p(y)p(xy)p(x,y)=p(y)p(x|y) is generative since it can be used to generate features given the target.

A model of the form p(yx)p(y|x) is discriminative since it can only be used to discriminate between the targets.

9.4.1 Advantages of the Discriminative Classifiers

  • Better predictive accuracy because p(yx)p(y|x) is often much simpler to learn than p(x,y)p(x,y).

Screen Shot 2023-06-19 at 09.02.21.png

  • Can arbitrarily handle feature preprocessing, for example, polynomial expansion of the feature inputs, or replace strings with embeddings. It is hard to do with generative models since the new features can be correlated in complex ways.
  • Well-calibrated probabilities. Some generative models like NBC make strong and often invalid independence assumptions, leading to extreme posterior class probabilities (near 0 and 1).

9.4.2. Advantages of the Generative Classifiers

  • Easy to fit, the NBC only needs counting and averaging, whereas Logistic Regression needs to solve a convex optimization problem and Neural Nets non-convex optimizations.
  • Can easily handle missing values by filling them with the fitted generative model (can be the empirical mean for each variable)
  • Can fit classes separately, we estimate each class density independently. With discriminative models, however, all parameters interact so we need to retrain the model if we add new classes.
  • Can handle unlabeled training data, it is easy to use generative models for semi-supervised learning, in which we combine labeled data Dxy={(xn,yn)}\mathcal{D}_{xy}=\{(x_n,y_n)\} and unlabeled data Dx={xn}\mathcal{D}_x=\{x_n\}. Discriminative classifiers have no uniquely optimal way to leverage Dx\mathcal{D}_x.
  • May be more robust to spurious features, a discriminative model p(yx)p(y|x) may choose features that discriminate yy well on the training set, but hardly generalize outside of it. By contrast, generative models p(xy)p(x|y) may be better at capturing the causal mechanism underlying the data generative process, and hence be more robust to distribution shifts.

9.4.3 Handling missing features

With a generative model, we can easily deal with missing parts of xx during training or testing (we assume a MAR situation).

For example, suppose we don’t have access to x1x_1, we have to compute:

p(y=cx2:D,θ)p(y=cπ)p(x2:Dy=c,θ)=p(y=cπ)x1p(x1,x2:Dy=c,θ)\begin{align} p(y=c|x_{2:D},\theta)&\propto p(y=c|\pi) p(x_{2:D}|y=c,\theta)\\ &= p(y=c|\pi) \sum_{x_1}p(x_1,x_{2:D}|y=c,\theta) \end{align}

If we make the Naive Bayes assumption, it leads to:

x1p(x1,x2:Dy=c,θ)=x1p(x1y=c,θ)d=2p(xdy=c,θ)=d=2p(xdy=c,θ)\begin{align} \sum_{x_1}p(x_1,x_{2:D}|y=c,\theta)&= \sum_{x_1} p(x_1|y=c,\theta)\prod_{d=2}p(x_d|y=c,\theta)\\ &= \prod_{d=2}p(x_d|y=c,\theta) \end{align}