11.3 Ridge regression
Maximum likelihood estimation can result in overfitting, a simple solution is to use MAP estimation with zero-mean Gaussian prior:
We compute the MAP as:
Therefore we are penalizing the weights that become too large in magnitude. This is called regularization or weight decay.
We don’t penalize the term since it doesn’t contribute to overfitting.
11.3.1 Computing the MAP estimate
The MAP estimation corresponds to computing the minimizing the objective:
we have:
Hence:
11.3.1.1 Solving using QR
Naively performing the matrix inversion can be slow and numerically unstable. We propose a way to convert the problem to a standard least square, where we can apply QR decomposition as previously seen.
We assume the prior has the form where is the precision matrix.
We can emulate this prior by augmenting our training data as:
where is the Cholesky decomposition.
We now show that the RSS of this expended data is equivalent to the penalized RSS on the original data:
Hence, the MAP estimate is given by:
And then solving using standard OLS method, by computing the QR decomposition of . This takes
11.3.1.2 Solving using SVD
In this section, we assume , which is a framework that suits ridge regression well. In this case, it is faster to compute SVD than QR.
Let , with
- , so that
- , so that
One can show that:
In other words, we can replace the -dimensional vector with -dimensional vector and perform our penalized fit as before.
The resulting complexity is , which is less than if
11.3.2 Connection between ridge regression and PCA
The ridge predictions on the training set are given by:
with:
Hence:
In contrast, the least square prediction are:
If , the direction will have a small impact on the prediction. This is what we want, since small singular values corresponds to direction with high posterior variance. These are the directions ridge shrinks the most.
There is a related technique called principal components regression, a supervised PCA reducing dimensionality to followed by a regression. However, this is usually less accurate than ridge, since it only uses features, when ridge uses a soft-weighting of all the dimensions.
11.3.3 Choosing the strength of the regularizer
To find the optimal , we can run cross-validation on a finite set of values and get the expected loss.
This approach can be expensive for large set of hyper-parameters, but fortunately we can often warm-start the optimization procedure, using the value of as a initializer for .
If we set , we start from a high amount of regularization and gradually diminish it.
We can also use empirical Bayes to choose , by computing:
where is the marginal likelihood.
This gives the same result as CV estimate, however the Bayesian approach only fit a single model, and is a smooth function of , so we can use gradient-based optimization instead of a discrete search.