Proba ML
11. Linear Regression
11.2 Least Squares Linear Regression

11.2 Least squares linear regression

11.2.1 Terminology

We consider models of the form:

p(yx,θ)=N(yw0+wx,σ2)p(y|\bold{x},\theta)=\mathcal{N}(y|w_0+\bold{w}^\top\bold{x},\sigma^2)

The vector w1:D\bold{w}_{1:D} is known as weights. Each coefficient wdw_d specifies the change in output we expect if we change the corresponding input xdx_d by 1 unit.

w0w_0 is the bias and corresponds to the output when all inputs are 0. This captures the unconditional mean of the response w0=E[y]w_0=\mathbb{E}[y]

By assuming x\bold{x} to be written as [1,x1,,xD][1,x_1,\dots,x_D], we can absorb the term w0w_0 in w\bold{w}.

If the input is multi-dimensional yRj\bold{y}\in\mathbb{R}^j, this is called multivariate linear regression:

p(yx,W)=j=1JN(yjwjx,σj2)p(y|\bold{x},W)=\prod_{j=1}^J\mathcal{N}(y_j|\bold{w}_j^\top \bold{x},\sigma_j^2)

Generally, a straight line will underfit most datasets, but we can apply nonlinear transformations to the input features x=ϕ(x)\bold{x}'=\phi(\bold{x}).

As long as the parameters of ϕ\phi are fixed, the model remains linear in the parameters, even if it’s not linear in the raw inputs.

Polynomial expansion is an example of non-linear transformation: if the input is 1d and we use an expansion of degree dd, we get ϕ(x)=[1,x,x2,,xd]\phi(x)=[1,x,x^2,\dots,x^d]

11.2.2 Least square estimation

To fit a linear regression model to the data, we minimize the negative log-likelihood on the training set:

NLL(w,σ2)=n=1Nlog[(12πσ2)1/2exp(12σ2(ynwxn)2)]=N2log(2πσ2)+12σ2n=1N(ynwxn)2\begin{align} \mathrm{NLL}(\bold{w},\sigma^2)&=-\sum_{n=1}^N \log \Bigg[\Big(\frac{1}{2\pi \sigma^2}\Big)^{1/2} \exp\Big(- \frac{1}{2\sigma^2} (y_n-\bold{w}^\top \bold{x}_n)^2\Big)\Bigg]\\ &= \frac{N}{2} \log (2\pi \sigma^2)+\frac{1}{2\sigma^2}\sum_{n=1}^N (y_n-\bold{w}^\top \bold{x}_n)^2 \end{align}

If we only focus on optimizing w\bold{w} and not σ2\sigma^2, the NLL is equal to the residual sum of square (RSS):

RSS(w)=12n=1n(ynwxn)2=12Xwy22\mathrm{RSS}(\bold{w})=\frac{1}{2}\sum_{n=1}^n (y_n-\bold{w}^\top \bold{x}_n)^2=\frac{1}{2}||X\bold{w}-\bold{y}||^2_2

11.2.2.1 Ordinary least square

We can show that the gradient is given by:

wRSS(w)=XXwXy\nabla_\bold{w}\mathrm{RSS}(\bold{w})=X^\top X \bold{w}-X^\top\bold{y}

We set it to zero and obtained the ordinary least square solution:

w^=(XX)1Xy\hat{\bold{w}}=(X^\top X)^{-1}X^\top \bold{y}

The quantity X=(XX)1XX^\dag =(X^\top X)^{-1}X^\top is the left pseudo-inverse of the (non-square) matrix XX.

We check that the solution is unique by showing that the Hessian is positive definite:

H(w)=XX\mathrm{H}(\bold{w})=X^\top X

If XX is full rank (its columns are linearly independent) then H\mathrm{H} is positive definite, hence the least square objective has a global minimum.

Screen Shot 2023-07-12 at 10.02.51.png

11.2.2.2 Geometric interpretation of least squares

We assume N>DN>D, so there are more equations than unknown, the system is over-determined.

We seek a vector y^RN\hat{\bold{y}}\in\mathbb{R}^N that lies in the linear subspace spanned by XX, as close as possible to y\bold{y}:

arg miny^span({x:,1,,x:,d})yy^2\argmin_{\hat{\bold{y}}\in \mathrm{span}(\{\bold{x}_{:,1},\dots,\bold{x}_{:,d}\})} ||\bold{y-\hat{y}}||_2

Since y^span(X)\bold{\hat{y}}\in\mathrm{span}(X), there exists some weight vector w\bold{w} such that:

y^=Xw\bold{\hat{y}}=X\bold{w}

To minimize the norm of the residual yy^\bold{y-\hat{y}}, we want the residual to be orthogonal to every column of XX:

X(yXw)=0w=(XX)1XyX^\top (\bold{y}-X\bold{w})=\bold{0}\Rightarrow \bold{w}=(X^\top X)^{-1}X^\top \bold{y}

Hence, our projected value is given by:

y^=X(XX)1Xy\bold{\hat{y}}=X(X^\top X)^{-1}X^\top \bold{y}

Screen Shot 2023-07-12 at 10.02.56.png

11.2.2.3 Algorithmic issues

Even if we can theoretically invert XXX^\top X to compute the pseudo-inverse XX^\dag, we shouldn’t do it since XXX^\top X could be ill-conditioned or singular.

A more general approach is to compute the pseudo-inverse using SVD. scikit-learn.linear_model uses scipy.linalg.lstsq, which in turn uses DGELSD, which is an SVD-based solver implemented by the Lapack library, written in Fortran.

However, if NDN\gg D, it can be faster to use QR decomposition:

(QR)w=yQQRw=Qyw=R1(Qy)\begin{align} (QR)\bold{w}&=y \\ Q^\top QR\bold{w}&=Q^\top y\\ \bold{w}&=R^{-1}(Q^\top y) \end{align}

Since RR is upper triangular, we can solve it using back substitution, avoiding matrix inversion.

An alternative to matrix decomposition is to use an iterative solver, such as the conjugate gradient method (assuming XX is symmetric positive definite) and the GMRES (generalized minimal residual method) in the general case.

11.2.2.4 Weighted least squares

In some cases, we want to associate a weight with each example. For example, in heteroskedastic regression, the variance depends on the input:

p(yx,θ)=N(ywx,σ(x))p(y|\bold{x},\theta)=\mathcal{N}(y|\bold{w}^\top \bold{x},\sigma(\bold{x}))

thus:

p(yx,θ)=N(yXw,Λ1)p(\bold{y}|\bold{x},\theta)=\mathcal{N}(\bold{y}|X\bold{w},\Lambda^{-1})

where Λ=diag(1/σ2(xn))\Lambda=\mathrm{diag}(1/\sigma^2(\bold{x}_n)).

One can show that the MLE is given by:

w^=(XΛX)1XΛy\bold{\hat{w}}=(X^\top \Lambda X)^{-1}X^\top \Lambda\bold{y}

11.2.3 Other approaches to computing the MLE

11.2.3.1 Solving for offset and slope separately

Above, we computed (w0,w)(w_0,\bold{w}) at the same time by adding a constant 1\bold{1} column to XX.

Alternatively, we can solve for w0w_0 and w\bold{w} separately:

w^=(XcXc)1Xcyc=[n=1N(xnxˉ)(xnxˉ)]1[n=1N(xnxˉ)(ynyˉ)]w^0=yˉxˉw^\begin{align} \bold{\hat{w}}&=(X^\top_cX_c)^{-1}X^\top_c \bold{y}_c \\ &= \Big[\sum_{n=1}^N (\bold{x}_n-\bar{\bold{x}})(\bold{x}_n-\bar{\bold{x}})^\top\Big]^{-1}\Big[\sum_{n=1}^N (\bold{x}_n-\bar{\bold{x}})(\bold{y}_n-\bar{\bold{y}}) \Big] \\ \hat{w}_0&= \bar{\bold{y}} -\bar{\bold{x}}^\top\bold{\hat{w}} \end{align}

Thus, we can first estimate w^\hat{\bold{w}} on centered data, then compute w0w_0.

11.2.3.2 1d linear regression

In the case of 1d inputs, the above formula reduces to:

w^1=n=1N(xnxˉ)(ynyˉ)n=1N(xnxˉ)2=CxyCxxw^0=yˉw^1xˉ=E[y]w^1E[x]\begin{align} \hat{w}_1&=\frac{\sum_{n=1}^N (x_n-\bar{x})(y_n-\bar{y})}{\sum_{n=1}^N (x_n-\bar{x})^2}=\frac{C_{xy}}{C_{xx}} \\ \hat{w}_0 &= \bar{y}-\hat{w}_1 \bar{x} =\mathbb{E}[y]-\hat{w}_1\mathbb{E}[x] \end{align}

where Cxy=Cov[X,Y]C_{xy}=\mathrm{Cov}[X,Y] and Cxx=Cov[X,X]=V[X]C_{xx}=\mathrm{Cov}[X,X]=\mathbb{V}[X]

11.2.3.3 Partial regression

We can compute the regression coefficient in the 1d case as:

RXYxE[YX=x]=w^1R_{XY}\triangleq \frac{\partial }{\partial x}\mathbb{E}[Y|X=x]=\hat{w}_1

Consider now the 2d case: Y=w0+w1X1+w2X2+ϵY=w_0+w_1 X_1 +w_2 X_2+\epsilon, where E[ϵ]=0\mathbb{E}[\epsilon]=0

The optimal regression coefficient for w1w_1 is RYX1.X2R_{Y X_1.X_2} where we keep X2X_2 constant:

w1=RY:X1.X2=xE[YX1=x,X2]w_1=R_{Y:X_1.X_2}=\frac{\partial}{\partial x}\mathbb{E}[Y|X_1=x,X_2]

We can extend this approach to all coefficients, by fixing all inputs but one.

11.2.3.4 Online computing of the MLE

We can update the means online:

xˉ(n+1)=xˉ(n)+1n+1(xn+1xˉ(n))\bar{x}^{(n+1)}=\bar{x}^{(n)}+\frac{1}{n+1}(x_{n+1}-\bar{x}^{(n)})

One can show that we can also update the covariance online as:

Cxy(n+1)=1n+1[xn+1yn+1+nCxy(n)+nxˉ(n)yˉ(n)(n+1)xˉ(n+1)yˉ(n+1)]C_{xy}^{(n+1)}=\frac{1}{n+1}[x_{n+1} y_{n+1}+nC_{xy}^{(n)}+n\bar{x}^{(n)}\bar{y}^{(n)}-(n+1)\bar{x}^{(n+1)}\bar{y}^{(n+1)}]

We find similar results for Cxx(n+1)C_{xx}^{(n+1)}.

Screen Shot 2023-07-13 at 11.33.30.png

To extend this approach to DD dimensions inputs, the easiest approach is SGD. The resulting algorithm is called least mean square.

11.2.3.5 Deriving the MLE from a generative perspective

Linear regression is a discriminative model of the form p(yx)p(y|\bold{x}), but we can also use generative models for regression.

The goal is to compute the conditional expectation:

f(x)=E[yx]=yp(yx)dy=yp(x,y)dyp(x,y)dyf(\bold{x})=\mathbb{E}[y|\bold{x}]=\int y\,p(y|\bold{x})dy=\frac{\int y\,p(\bold{x},y)dy}{\int p(\bold{x},y)dy}

Suppose we fit p(x,y)p(\bold{x},y) using a MVN. The MLEs for the parameters of the joint distribution are the empirical means and covariance:

μx=1Nnxnμy=1NnynΣxx=1Nn(xnμx)(xnμx)=1NXcXcΣxy=1Nn(xnμx)(ynμy)=1NXcyc\begin{align} \mu_x&=\frac{1}{N}\sum_n \bold{x}_n \\ \mu_y&=\frac{1}{N}\sum_n y_n \\ \Sigma_{xx}&=\frac{1}{N}\sum_n (\bold{x}_n -\mu_x)(\bold{x}_n-\mu_x)^\top = \frac{1}{N} X_c^\top X_c \\ \Sigma_{xy}&=\frac{1}{N}\sum_n (\bold{x}_n -\mu_x)(y_n-\mu_y)^\top = \frac{1}{N} X_c^\top \bold{y}_c \end{align}

Hence we have:

E[yx]=μy+ΣxyΣxx1(xμx)\mathbb{E}[y|\bold{x}]=\mu_y+\Sigma_{xy}^\top\Sigma_{xx}^{-1}(\bold{x}-\mu_x)

which we can rewrite as:

E[yx]=w0+wx\mathbb{E}[y|\bold{x}]=w_0+\bold{w}^\top \bold{x}

using w=ΣxyΣxx1\bold{w}=\Sigma_{xy}^\top \Sigma^{-1}_{xx}

Thus we see that fitting the joint model and then conditioning it is the same as fitting the conditional model. However, this is only true for Gaussian models.

11.2.3.6 Deriving the MLE for σ2\sigma^2

After estimating w^mle\hat{\bold{w}}_{mle}, it is easy to show that:

σ^mle2=arg minσ2NLL(w^,σ2)=1Nn(ynwxn)2\hat{\sigma}_{mle}^2=\argmin_{\sigma^2}\mathrm{NLL}(\bold{\hat{w}},\sigma^2)=\frac{1}{N}\sum_n(y_n-\bold{w}^\top \bold{x}_n)^2

which is the MSE on the residuals.

11.2.4 Measuring goodness of a fit

11.2.4.1 Residuals plot

We can check the fit of the model by plotting the residuals rn=yny^nr_n=y_n-\hat{y}_n vs the input xn\bold{x}_n, for a chosen dimension.

Screen Shot 2023-07-15 at 11.24.44.png

The residual model assumes a N(0,σ2)\mathcal{N}(0,\sigma^2) distribution, therefore the residual plot should show a cloud of points around 0 without any obvious trend. This is why b) above is a better fit than a).

We can also plot the prediction y^\hat{\bold{y}} against y\bold{y}, which should show a diagonal line.

Screen Shot 2023-07-15 at 11.24.53.png

11.4.2.2 Prediction accuracy and R2

We can assess the fit quantitatively using the RSS or the RMSE:

RSS=n(ynwxn)2RMSE1NRSS(w)\begin{align} \mathrm{RSS}&=\sum_n (y_n-\bold{w}^\top \bold{x}_n)^2\\ \mathrm{RMSE}&\triangleq\frac{1}{N}\sqrt{RSS(\bold{w})} \end{align}

The R2, or coefficient of determination is a more interpretable measure:

R21n(y^nyn)2n(yˉyn)2=1RSSTSS1R^2\triangleq 1-\frac{\sum_n (\hat{y}_n-y_n)^2}{\sum_n(\bar{y}-y_n)^2}=1-\frac{\mathrm{RSS}}{\mathrm{TSS}}\leq 1

It measures the variance in the prediction relative to a simple constant prediction y^n=yˉ\hat{y}_n=\bar{y}.