Proba ML
18. Trees, Forests, Bagging and Boosting
18.1 Classification and regression tree (CART)

18.1 Classification and regression tree (CART)

Classification and regression trees (also called CART models or decision trees) are defined by recursively partitioning the input space and defining a local model on each resulting region of the input space.

The overall model can be represented by a tree, with one leaf per region.

18.1.1 Model definition

We start by considering regression tree.

Trees consist of a set of nested decision rules. At each node ii, the feature did_i of the input vector x\bold{x} is compared to a threshold tit_i: if it’s lower, the input flows to the left branch, otherwise to the right.

At the leaves of the tree, the model specify the predicted output for any input that falls in this region.

Screen Shot 2022-06-07 at 10.38.45.png

In the above example, the first region of space is defined by:

R1={x:x1t1,x2t2}R_1=\{\bold{x}:x_1\leq t_1,x_2 \leq t_2\}

The overall result is that we partition the 2d input space into 5 regions. We can now associate a mean response with each of these regions, resulting in a piecewise constant surface.

The output from region 1 can be estimated using:

wj=n=1NynI(xnR1)n=1NI(xnR1)w_j=\frac{\sum_{n=1}^Ny_n \mathbb{I}(\bold{x}_n\in R_1)}{\sum_{n=1}^N\mathbb{I}(\bold{x}_n\in R_1)}

Formally, a regression tree can be defined as:

f(x;θ)=j=1JwjI(xjRj)f(\bold{x};\bold{\theta})=\sum_{j=1}^Jw_j \mathbb{I}(\bold{x}_j\in R_j)

where wjw_j is the predicted output for the node jj and θ={(Rj,wj):j=1:J}\bold{\theta}=\{(R_j,w_j): j=1:J\} where JJ is the number of nodes. We can express the regions as R1=[(d1t1),(d2,t2)]R_1=[(d_1\leq t_1),(d_2,\leq t_2)].

For classification problems, the leaves contains a distribution over the class labels rather than the mean response.

18.1.2 Model fitting

To fit the model, we need to minimize the following loss:

L(θ)=n=1N(yn,f(xn;θ))=j=1JxjRj(yn;wj)\mathcal{L}(\bold{\theta})=\sum_{n=1}^N\ell(y_n,f(\bold{x}_n;\bold{\theta}))= \sum_{j=1}^J\sum_{\bold{x}_j \in R_j} \ell(y_n; w_j)

Unfortunately, this is not differentiable because of the need to learn the tree structure. Finding the optimal partitioning of the data is NP-complete.

The standard practice is to use a greedy procedure to grow the tree one node at a time. This approach is used by CART, C4.5 and ID3 which are three popular implementations of the method. Scikit-learn uses an optimized version of CART.

The idea is the following. Suppose we are at node ii; let Di={(xn,yn)Ni}D_i=\{(\bold{x}_n, y_n)\in N_i\} be the set of examples that reach this node. We consider how to split this node into a left branch and a right branch to minimize the error in each child subtree.

If the jjth feature is a real-valued scalar, we can partition Di\mathcal{D}_i by comparing to a threshold tt. The set of possible thresholds Tj\mathcal{T}_j for feature jj can be obtained by sorting the unique values of xnjx_{nj}.

For each possible threshold we define the left and right splits:

DiL(j,t)={(xn,yn)Ni:xn,jt}DiR(j,t)={(xn,yn)Ni:xn,j>t}D_i^L(j,t)=\{ (\bold{x}_n,y_n) \in N_i:x_{n,j}\leq t\} \\ D_i^R(j,t)=\{ (\bold{x}_n,y_n) \in N_i:x_{n,j}>t\}

If the jjth feature is categorical, with KjK_j possible values, we check if the feature if the feature is equal to each of those values or not. This defines a KjK_j set of possible binary splits:

DiL(j,t)={(xn,yn)Ni:xn,j=t}DiR(j,t)={(xn,yn)Ni:xn,jt}D_i^L(j,t)=\{ (\bold{x}_n,y_n) \in N_i:x_{n,j}= t\} \\ D_i^R(j,t)=\{ (\bold{x}_n,y_n) \in N_i:x_{n,j}\neq t\}

Alternatively, we could allow multi-way split, but this could lead to data fragmentation with too little data falling into each subtree, leading to overfitting. Therefore it is more common to use binary splits.

Once we have computed DiL(j,t)D^L_i(j,t) and DiR(j,t)D_i^R(j,t) for each jj and tt at node ii, we choose the best feature jij_i to split on, and the best value for this feature tit_i as follows:

(ji,ti)=argminj{1,,D}mintTjDiL(j,t)Dic(DiL(j,t))+DiR(j,t)Dic(DiR(j,t))(j_i,t_i)=\arg \min_{j\in\{1,\dots,D\}} \min_{t\in\mathcal{T}_j} \frac{|\mathcal{D}_i^L(j,t)|}{|\mathcal{D}_i|}c(\mathcal{D}_i^L(j,t))+ \frac{|\mathcal{D}_i^R(j,t)|}{|\mathcal{D}_i|}c(\mathcal{D}_i^R(j,t))

We then partition the data and call the fitting algorithm recursively on each subset of the data.

We now discuss the cost function c(Di)c(\mathcal{D}_i) which is used to evaluate the cost of node ii.

For regression, we could use the mean squared error:

c(Di)=1DinDi(yˉyn)2c(\mathcal{D}_i)=\frac{1}{|\mathcal{D}_i|}\sum_{n \in \mathcal{D}_i}(\bar{y}-y_n)^2

where yˉ=1DinDiyn\bar{y}=\frac{1}{|D_i|}\sum_{n\in\mathcal{D}_i}y_n is the mean of the response variable for examples reaching node ii.

For classification, we begin by computing the empirical distribution over class labels for this node:

π^i,c=1DinDiI(yn=c)\hat{\pi}_{i,c}=\frac{1}{|\mathcal{D}_i|}\sum_{n\in\mathcal{D}_i}\mathbb{I}(y_n=c)

Given this, we can compute the Gini index:

Gi=c=1Cπ^i,c(1π^i,c)=1c=1Cπ^i,c2G_i=\sum_{c=1}^C \hat{\pi}_{i,c}(1-\hat{\pi}_{i,c})=1-\sum_{c=1}^C \hat{\pi}_{i,c}^2

This is the expected error rate, since π^i,c\hat{\pi}_{i,c} is the probability for a random entry in the leaf belong to class cc, and 1π^i,c1-\hat{\pi}_{i,c} the probability it would be misclassified.

Alternatively we can define the cost as the entropy or deviance of the node:

Hi=H(π^i)=c=1Cπ^i,clog(π^i,c)H_i=\mathbb{H}(\hat{\pi}_{i})=-\sum_{c=1}^C \hat{\pi}_{i,c}\log(\hat{\pi}_{i,c})

A node that is pure (i.e. only has example of one class) will have 0 entropy.

18.1.3 Regularization

If we let the tree become deep enough, it can reach an error 0 (assuming no label noise) by partitioning the input space is sufficiently small regions. This leads to overfitting, and there are two main regularization approaches:

  1. Stop the tree growing process using some heuristic, like reaching a maximum depth or having too few sample at a node.
  2. Grow the tree to its maximum depth and prune it back, by merging split subtree into their parent. This can partially overcome the greedy nature of top-down tree growing. However, forward growing and backward pruning is slower than the greedy top-down approach.

18.1.4 Handling missing input features

In general, it is hard for discriminative models to handle missing input features. However, for trees, simple heuristics can work well.

  1. Surrogate splits leverages highly correlated features, inducing similar partition when the chosen variable is missing at test time. This can be thought as learning a local joint model of the input. This has the advantage over a generative model to not modeling the entire joint distribution of inputs, but has the disadvantage of being completely ad-hoc
  2. A simpler approach is to create a new “missing” category, and then treat data as fully observed.

18.1.5 Pros and cons

Tree models are popular for several reasons:

  • They are easy to interpret
  • They can easily handle mixed discrete and continuous inputs
  • They are insensitive to monotone transformations of the inputs (since splits are based on ranking the data points) so there is no need to standardize the data
  • They perform automatic variable selection
  • They are relatively robust to outliers
  • They are fast to fit, and scale well to large data sets.
  • They can handle missing value features

However, the primary disadvantage of trees is that they don’t predict very accurately compared to other kind of model, due to the greedy nature of tree construction.

A related problem is that they are unstable: small changes of the input data can have large effects on the structure of the tree, due to its hierarchical nature.

Below, omitting a single training point can dramatically change the decision surface, due to the use of axis parallel split.

Screen Shot 2023-10-18 at 08.54.51.png

We’ll later see how this instability can be turned into a virtue.