18.2 Ensemble learning
We saw that decision tree can be quite unstable, in the sense that their predictions might vary a lot with a small perturbation in the input data. They are high variance estimators.
A simple way to reduce the variance is to average over multiple models. This is called ensemble learning. The result model has the form:
where is the th base model.
The ensemble will have similar bias to the base models, but lower variance, generally resulting in better overall performances.
Averaging is a sensible way to combine predictions from regression models. For classifiers, we take a majority vote of the outputs (called committee method)
To see why this can help, suppose each base model is a binary classifier with accuracy , and suppose 1 is the correct class. Let be the prediction for the th model and the number of class of vote for class 1.
We define the final predictor to be the majority vote, i.e. class 1 if and class 0 otherwise. The probability that the ensemble will pick class 1 is:
where is the cdf of the Binomial distribution evaluated in .
For and , we get . With we get .
The performance of the voting approach is dramatically improved because we assume each predictor made independent errors. In practice, their mistakes might be correlated, but as long as we ensemble sufficiently diverse models, we still can come ahead.
18.2.1 Stacking
An alternative to using unweighted average or majority vote is to learn how to combine the base models, using stacking or “stacked generalization”:
We need to learn the combination weight on a separated dataset, otherwise all their mass will be put on the best performing base model.
18.2.2 Ensembling is not Bayes model averaging
Note that an ensemble of models is not the same as using BMA. An ensemble considers a larger hypothesis class of the form:
whereas BMA uses:
The key difference is that the BMA weights sum to one, and in the limit of infinite data, only a single model will be chosen (the MAP model). In contrary, the ensemble weights are arbitrary and don’t collapse in this way in a single model.