Proba ML
18. Trees, Forests, Bagging and Boosting
18.6 Interpreting Tree Ensembles

18.6 Interpreting tree ensembles

Trees are popular because they are interpretable. Unfortunately, ensembles of trees lose that property.

Fortunately, there are some simple methods to interpret what function has been learned.

18.6.1 Feature importance

For a single decision tree TT, we can consider the following measure of feature importance for feature kk:

Rk(T)=j=1J1GjI(vj=k)R_k(T)=\sum_{j=1}^{J-1}G_j \mathbb{I}(v_j=k)

where the sum is over all the non-leaf (internal) nodes, GjG_j is the gain in accuracy (reduction in cost) at node jj, and vj=kv_j=k if node jj uses feature kk.

We can get a more accurate estimate by averaging over all trees of the ensemble:

Rk=1Mm=1NRk(Tm)R_k=\frac{1}{M}\sum_{m=1}^N R_k(T_m)

We can then normalize the scores so that the highest is 100%.

Screen Shot 2023-10-25 at 09.44.02.png

Screen Shot 2023-10-25 at 09.44.10.png

However, there are two limitations of impurity-based feature importances:

  • impurity-based importances are biased towards high cardinality features;
  • impurity-based importances are computed on training set statistics and therefore do not reflect the ability of feature to be useful to make predictions that generalize to the test set (when the model has enough capacity).

Instead, scikit-learn suggests using permutation importance (opens in a new tab).

18.6.2 Partial dependency plot (PDP)

After we have identified the most relevant input features, we can try to assess the impact they have on the output.

A partial dependency plot for feature kk has the form:

fˉk(xk)=1Ni=1Nf(xi,k,xk)\bar{f}_k(x_k)=\frac{1}{N}\sum_{i=1}^N f(\bold{x}_{i,-k},x_k)

We plot fkˉ\bar{f_k} vs xkx_k.

Thus we marginalize out all features except kk. In the case of binary classifier we can convert the output in log odds before plotting.

Screen Shot 2023-10-25 at 09.53.57.png

In figure a), we see that the probability of the spam increases when the frequency of “!” and “remove” increases.

Conversely, this probability decreases when the frequency of “edu” or “hp” increases.

We can also try to capture interaction effects between feature jj and kk by computing:

fˉjk(xj,xk)=1Nn=1Nf(xj,k,xj,xk)\bar{f}_{jk}(x_j,x_k)=\frac{1}{N}\sum_{n=1}^N f(\bold{x}_{-j,k},x_j,x_k)

In figure b), we see that the probability of spam increases when the frequency of “!” increases, but much more so when the word “hp” is missing.

See scikit-learn documentation for computation methods. (opens in a new tab)