5.4 Empirical risk minimization
How to apply frequentist decision theory to supervised learning?
5.4.1 Empirical risk
In supervised learning, the true state of nature is the distribution and the estimator is a prediction function .
We define the population risk as:
We can approximate using its empirical distribution:
Plugging this gives the empirical risk:
is a random variable since it depends on the training set. We chose an estimator by minimizing the empirical risk over a specific hypothesis space of functions :
Approximation error vs generalization error
Let’s define:
- , the function that achieve the minimal possible population risk
- the best function of our hypothesis space
- the function that minimizes the empirical risk, since we can’t compute the population risk.
One can show that the risk of our chosen predictor can be compared to the best possible estimator with a two terms decomposition: the approximation error and the generalization error.
We can approximate this by the difference between the training and testing set errors:
We can reduce the approximation error with a more complex model, but it may result in overfitting and increasing the generalization error.
Regularized risk
We add a complexity penalty to the objective function. Since we usually work with parametric functions, we apply the regularizer to the parameters themselves:
with a hyperparameter.
If the risk is the log-loss and the penalty is the negative log prior, minimizing the empirical regularized risk is equivalent to estimating the MAP:
5.4.2 Structural risk
We estimate the hyperparameters with bilevel optimization:
However, this won’t work since this technique will always pick the least amount of regularization .
If we knew the population risk, we could minimize the structural risk and find the right complexity (the value of ).
We can estimate the population risk via cross-validation or statistical learning.
5.4.3 Cross validation
Let and be two partitions of , and:
For each model , we fit it to the training set to get .
We then use the unregularized risk on the validation set as an estimate of the population risk:
5.4.4 Statistical learning theory (SLT)
Cross validation is slow, so in SLT we derive analytically upper bound of the population risk (or more precisely the generalization error). If the bound is satisfied, we can be confident that minimizing the empirical risk will have low population risk.
For binary classifiers, we say the hypothesis is probably approximately correct (PAC), and the hypothesis class is PAC learnable.
Bounding the generalization error
Let use a finite of hypothesis space and . For any dataset of size drawn from :
where:
- is the population risk
- is the empirical risk
VC dimension
If the hypothesis space is infinite, we need to use an estimate of the degree of freedom of the hypothesis class. This is called VC dimension. Unfortunately this is hard to compute and the bounds are very loose.
However various estimates of the generalization error like the PAC-Bayesian bounds have recently been designed, especially for DNNs.