Proba ML
20. Dimensionality Reduction
20.1 Principal Component Analysis (PCA)

20.1 Principal component analysis (PCA)

The simplest and most widely used form of dimensionality reduction is PCA. The basic idea is to find a linear and orthogonal projection of the high dimensional data xRD\bold{x}\in \R^D to a low dimensional subspace zRL\bold{z}\in \R^L, such that the low dimensional representation is a “good approximation” to the original data.

More specifically, if we project or encode x\bold{x} to get z=Wx\bold{z}=W^\top \bold{x} and then unproject or decode z\bold{z} to get x^=Wz\hat{\bold{x}}=W\bold{z}, then we want x\bold{x} and x^\hat{\bold{x}} to be close in the 2\ell_2 distance.

We define the following reconstruction error or distortion:

L(W)1Nn=1Nxndecode(encode(xn;W);W))22\mathcal{L}(W)\triangleq \frac{1}{N}\sum_{n=1}^N ||\bold{x}_n-\mathrm{decode}(\mathrm{encode(\bold{x}_n;W);W}))||_2^2

where the encoder and decoder stages are both linear maps.

We can minimize this objective by setting W^=UL\hat{W}=U_L, where ULU_L contains the LL eigenvectors with the largest eigenvalues of the empirical covariance matrix:

Σ^=1Nn=1N(xnx^)(xnx^)=1NXcXc\hat{\Sigma}=\frac{1}{N}\sum_{n=1}^N (\bold{x}_n-\hat{\bold{x}})(\bold{x}_n-\hat{\bold{x}})^\top=\frac{1}{N}X_c^\top X_c

where XcX_c is the centered version of the N×DN\times D design matrix.

This is equivalent to maximizing the likelihood of a latent linear Gaussian model known as probabilistic PCA.

20.1.1 Examples

We project below 2d data to a 1d line. This direction captures the most variation in the data.

Screen Shot 2023-11-01 at 15.19.20.png

We show PCA applied to the Olivetti face image dataset, which is a set of 64 x 64 grayscale images.

We project these to a 3d subspace and display the basic vectors (the columns of the projection matrix WW) known as eigenfaces. We see that the main mode of variation of the data are related to overall lighting and then differences around the eyebrows

Screen Shot 2023-11-01 at 15.42.37.png

If we use enough dimensions (but less than the original 4096) we can use the representation z=Wx\bold{z}= W^\top \bold{x} as input to a nearest-neighbor classifier to perform face recognition. This is faster and more reliable than working in the pixel space.

20.1.2 Derivation of the algorithm

Suppose we have an unlabeled dataset D={xn:n=1:N}\mathcal{D}=\{\bold{x}_n:n=1:N\} where xnRD\bold{x}_n\in \R^D. We can represent this as a XRN×DX\in \R^{N\times D} data matrix.

We assume xˉ=1Nn=1Nxn=0\bar{\bold{x}}=\frac{1}{N}\sum_{n=1}^N \bold{x}_n=\bold{0}, which can be assured by centering the data.

We would like to approximate each xn\bold{x}_n by a latent vector znRL\bold{z}_n\in \R^L. The collection of these latent variable are called the latent factors.

We assume each xn\bold{x}_n can be “explained” in terms of a weighed combination of basis functions w1,,wL\bold{w}_1,\dots,\bold{w}_L where each wkRD\bold{w}_k\in \R^D and the weigths are znRL\bold{z}_n\in \R^L, i.e we assume:

xnk=1Lzn,kwk\bold{x}_n\approx \sum_{k=1}^L z_{n,k}\bold{w}_k

We can measure the reconstruction error produced by this approximation as:

L(W,Z)=1NXZWF2=1NXWZF2=1Nn=1NxnWzn22\begin{align} \mathcal{L}(W,Z)&=\frac{1}{N}||X-ZW^\top||^2_F\\ &=\frac{1}{N}||X^\top-WZ^\top||^2_F \\ &=\frac{1}{N}\sum_{n=1}^N ||\bold{x}_n-W\bold{z}_n||^2_2 \end{align}

We want to minimize this subject to the constraint that WW is an orthogonal matrix.

We show below that the optimal solution is obtained by setting W^=UL\hat{W}=U_L where ULU_L contains the LL eigenvectors with largest eigenvalues of the empirical covariance matrix.

20.1.2.1 Base case

We start by estimating the best 1d solution w1RD\bold{w}_1\in \R^D. We will find the remaining basis vectors later.

Let the latent coefficient \bold{}for each data points associated with the first basis vector be z~1=[z1,1,,zN,1]RN\tilde{\bold{z}}_1=[z_{1,1},\dots,z_{N,1}]\in \R^N

The reconstruction error is given by:

L(w1,z~1)=1Nn=1Nxnz1,nw122=1Nn=1N(xnz1,nw1)(xnz1,nw1)=1Nn=1Nxnxn2z1,nw1xn+z1,n2w1w1\begin{align} \mathcal{L}(\bold{w}_1,\tilde{\bold{z}}_1)&=\frac{1}{N}\sum_{n=1}^N ||\bold{x}_n-z_{1,n}\bold{w}_1||^2_2\\ &=\frac{1}{N}\sum_{n=1}^N (\bold{x}_n-z_{1,n}\bold{w}_1)^\top(\bold{x}_n-z_{1,n}\bold{w}_1) \\ &= \frac{1}{N} \sum_{n=1}^N \bold{x}_n^\top \bold{x}_ n-2z_{1,n}\bold{w}_1^\top\bold{x}_n+z_{1,n}^2\bold{w}_1^\top \bold{w}_1 \end{align}

where w1w1=1\bold{w}_1^\top \bold{w}_1=1 by the orthonormality assumption.

Taking derivative wrt z1,nz_{1,n} gives:

z1,nL(w1,z~1)=2w1xn+2z1,n=0    z1,n=w1xn\begin{align} \frac{\partial}{\partial z_{1,n}}\mathcal{L}(\bold{w}_1,\tilde{\bold{z}}_1)&=-2\bold{w}_1\bold{x}_n+2z_{1,n}=0 \\ &\implies z_{1,n}=\bold{w}_1^\top\bold{x}_n \end{align}

So the optimal embedding is obtained by orthogonally projecting the data onto w1\bold{w}_1.

Plugging it back to loss gives us:

L(w1,z~1)=1Nn=1Nxnxnz1,n2=C1Nn=1Nz1,n2=C1Nn=1Nw1xnxnw1=Cw1Σ^w1\begin{align} \mathcal{L}(\bold{w}_1,\tilde{\bold{z}}_1^*)&=\frac{1}{N}\sum_{n=1}^N \bold{x}_n^\top\bold{x}_n -z_{1,n}^2\\ &=C-\frac{1}{N}\sum_{n=1}^N z_{1,n}^2 \\ &= C -\frac{1}{N}\sum_{n=1}^N \bold{w}_1^\top \bold{x}_n \bold{x}_n^\top \bold{w}_1\\ &=C-\bold{w}_1^\top \hat{\Sigma}\bold{w}_1 \end{align}

Where Σ\Sigma is the empirical covariance matrix (since we assumed the data is centered) and CC a constant.

We can trivially optimize this by letting w1\bold{w}_1\rightarrow \infin, so we impose the constraint w12=1||\bold{w}_1||_2=1 and instead optimize:

L~(w1,z1)=w1Σw1+λ1(w1w11)\begin{equation} \tilde{\mathcal{L}}(\bold{w}_1,\bold{z}_1^*)=-\bold{w}_1^\top\Sigma\bold{w}_1+\lambda_1(\bold{w}_1^\top \bold{w}_1-1) \end{equation}

where λ1\lambda_1 is a Lagrange multiplier.

Taking derivatives and equating to zero we have:

w1L(w1,z1)=2w1Σ^+2λ1w1=0    Σ^w1=λ1w1\begin{align} \frac{\partial }{\partial \bold{w}_1}\mathcal{L}(\bold{w}_1,\bold{z}_1^*)&=-2\bold{w_1}\hat{\Sigma}+2\lambda_1\bold{w}_1=0 \\ &\implies \hat{\Sigma }\bold{w}_1=\lambda_1 \bold{w}_1 \end{align}

Hence, the optimal direction onto which should project the data is an eigenvector of the covariance matrix.

And since we have:

w1Σ^w1=λ1\bold{w}_1^\top \hat{\Sigma} \bold{w_1}=\lambda_1

We can minimize the loss in (13) by picking the largest eigenvector which corresponds to the largest eigenvalue.

20.1.2.2 Optimal weight vector maximizes the variance of the projected data

Before continuing, we observe that, since the data has been centered:

E[z1,n]=E[w1xn]=w1E[xn]=0\mathbb{E}[z_{1,n}]=\mathbb{E}[\bold{w}_1^\top\bold{x}_n]=\bold{w}_1^\top\mathbb{E}[\bold{x}_n]=0

Hence the variance of the projected data is given by:

V[z~1]=E[z~12]=1Nn=1Nz1,n2=L(w1)+C\mathbb{V}[\tilde{\bold{z}}_1]=\mathbb{E}[\tilde{\bold{z}}_1^2]=\frac{1}{N}\sum_{n=1}^Nz_{1,n}^2=-\mathcal{L(\bold{w}_1)}+C

From this we see that minimizing the reconstruction error is equivalent to maximizing the variance of the projected data.

This is why it is often said that PCA finds the direction of maximal variance.

Screen Shot 2023-11-02 at 11.00.38.png

20.1.2.3 Induction step

Now let us find another direction w2\bold{w}_2 to further minimize the reconstruction error, subject to w1w2=0\bold{w}_1^\top \bold{w}_2=0 and w2w2=1\bold{w}_2^\top \bold{w}_2=1:

L(w1,z~1,w2,z~2)=1Nn=1Nxnz1,nw1z2,nw222\mathcal{L}(\bold{w}_1,\tilde{\bold{z}}_1,\bold{w}_2,\tilde{\bold{z}}_2)=\frac{1}{N}\sum_{n=1}^N||\bold{x}_n-z_{1,n}\bold{w}_1-z_{2,n}\bold{w}_2||_2^2

Optimizing wrt w1\bold{w}_1 and z~1\tilde{\bold{z}}_1 gives the same solution as before.

We can show that Lw2=0\frac{\partial \mathcal{L}}{\partial \bold{w}_2}=0 yields z2,n=w2xnz_{2,n}=\bold{w}_2^\top\bold{x}_n.

Substituting this solution yields:

L(w2)=1Nn=1Nxnxnw1xnxnw1w2xnxnw2=Cw2Σ^w2\begin{align} \mathcal{L}(\bold{w}_2)&=\frac{1}{N}\sum_{n=1}^N \bold{x}_n^\top \bold{x}_n-\bold{w}_1^\top \bold{x}_n \bold{x}_n^\top \bold{w}_1-\bold{w}_2^\top \bold{x}_n \bold{x}_n^\top \bold{w}_2 \\ &=C-\bold{w}_2^\top \hat{\Sigma}\bold{w}_2 \end{align}

Dropping the constant term, plugging the optimal w1\bold{w}_1 and adding the constraint yields:

L~(w2)=w2Σ^w2+λ1(w2w21)+λ2(w1w20)\tilde{\mathcal{L}}(\bold{w}_2)=-\bold{w}_2^\top\hat{\Sigma}\bold{w}_2+\lambda_1(\bold{w}_2^\top\bold{w}_2-1)+\lambda_2(\bold{w}_1^\top\bold{w}_2-0)

The solution is given by the eigenvector with the second largest eigenvalue:

Σ^w2=λ2w2\hat{\Sigma}\bold{w}_2=\lambda_2\bold{w}_2

The proof continue that way to show that W^=UL\hat{W}=U_L.

20.1.3 Computational issues

20.1.3.1 Covariance matrix vs Correlation matrix

We have been working with the covariance matrix, but it’s better to work with the correlation matrix because PCA might be “mislead” by the directions in which the variance is higher because of the measurement scale.

Screen Shot 2023-11-02 at 19.04.15.png

20.1.3.2 Dealing with high-dimensional data

We have introduced PCA as finding the eigenvectors of the covariance matrix XXRD×DX^\top X \in \R^{D\times D}.

When N<DN<D, it is faster to work with the Gram matrix XXRN×NXX^\top \in \R^{N\times N} instead. We show how to use it.

Let UU be the eigenvectors of XXXX^\top with corresponding eigenvalues Λ\Lambda. By definition, we have:

(XX)U=UΛ(XX^\top) U = U\Lambda

so by left-multiplying by XX^\topwe get:

(XX)(XU)=(XU)Λ(X^\top X)(X^\top U)=(X^\top U)\Lambda

from which we see that the eigenvectors of XXX^\top X are V=XUV=X^\top U, with the same eigenvalues Λ\Lambda.

However, these eigenvectors are not normalized, since:

vj2=ujXXuj=λjujuj=λj||\bold{v}_j||^2=\bold{u}_j^\top X X^\top \bold{u}_j=\lambda_j \bold{u}^\top_j\bold{u}_j=\lambda_j

The normalized eigenvector are given by:

V=XUΛ1/2V=X^\top U \Lambda^{-1/2}

This provides an alternative way to compute the PCA basis, it also allows us to use to kernel trick.

20.1.3.3 Computing PCA using SVD

We now show the equivalence between the PCA computed using eigenvectors and the truncated SVD.

Let UΣΛΣUΣU_\Sigma \Lambda_\Sigma U_\Sigma^\top be the top LL engendecomposition of the covariance matrix ΣXX\Sigma \propto X^\top X (assuming XX is centered).

We know that the optimal estimate of the projection is W=UΣW=U_\Sigma.

Now let UXSXVXU_X S_XV^\top _X be the LL-truncated SVD approximation of the matrix XX.

We know that the right singular vectors of XX are the eigenvectors of XXX^\top X, so VX=UΣ=WV_X =U_\Sigma=W.

In addition, the singular values of XX are related to the eigenvalues of XXX^\top X with:

λk=sk2/N\lambda_k=s_k^2/N

Now suppose we are interested in the principal components, we have:

Z=XW=UXSXVXVX=UXSXZ=XW=U_XS_XV_X^\top V_X=U_XS_X

Finally, if we want to reconstruct the data:

X^=ZW=UXSXVX\hat{X}=ZW^\top =U_XS_XV_X^\top

This is precisely the same as truncated SVD approximation.

Thus, we see that we can perform PCA either using a eigendecomposition of XXX^\top X or a SVD decomposition of XX. The latter is often preferable for computational reason.

For very high dimensional problem, we can use a randomized SVD algorithm.

20.1.4 Choosing the number of latent dimensions

20.1.4.1 Reconstruction error

Let us define the reconstruction error on some dataset D\mathcal{D} when using LL dimensions:

LL=1DnDxnx^n22\mathcal{L}_L=\frac{1}{|\mathcal{D}|}\sum_{n\in \mathcal{D}} ||\bold{x}_n-\bold{\hat{x}}_n||^2_2

where the reconstruction is given by x^n=Wzn+μ\bold{\hat{x}}_n=W\bold{z}_n+\mu and zn=W(x^nμ)\bold{z}_n=W^\top (\bold{\hat{x}}_n-\mu).

Screen Shot 2023-11-03 at 09.31.26.png

We see that it drops quite quickly after, indicating that we capture most of the empirical correlation with a small number of factors.

Of course, if we choose L=rank(X)L=\mathrm{rank}(X), we get zero reconstruction error on the training set. To avoid overfitting, we compute the reconstruction error on the test set.

Here we see that the error continues to decrease even when the model gets more complex. Thus, we don’t see the typical U-shaped curve that we see in supervised learning.

The problem is that PCA is not a proper generative model of the data: if you give it more latent dimensions, it will approximate the test data more accurately. A similar problem arises if we plot reconstruction error on the test set with K-means clustering.

20.1.4.2 Scree plot

A common alternative to plotting reconstruction error vs LL is to plot the eigenvalues in decreasing order of magnitude.

We can show that:

LL=j=L+1Dλj\mathcal{L}_L=\sum_{j=L+1}^{D}\lambda_j

Thus as the number of dimensions increases, the eigenvalues get smaller and so do the reconstruction error.

An related quantity is the fraction of variance explained:

FL=j=1Lλjj=1LmaxλjF_L=\frac{\sum_{j=1}^L \lambda_j}{\sum_{j=1}^{L_\mathrm{max}} \lambda_j}

Screen Shot 2023-11-03 at 09.44.01.png

20.1.3.4 Profile likelihood

Although there is no U-shape in the reconstruction error plot, there is sometimes an “elbow” where the error suddenly changes from relatively large to relatively small.

The idea is that, for L<LL<L^* where LL^* is the “true” latent dimensionality, the rate of decrease of errors will be high, whereas for L>LL>L^* the gains will be smaller since the model is already complex enough to capture the true distribution.

One way to automate the detection of the change of gradient in the curve is to profile likelihood.

Let λL\lambda_L be some measure of error incurred by the model of size LL, such that λ1λLmax\lambda_1\geq \dots \geq \lambda_{L^{max}}. In PCA these are the eigenvalues but the method can be applied to reconstruction error from K-means clustering.

We partition these values into two group depending on whether k<Lk<L or k>Lk>L, where LL is some threshold.

To measure the quality of LL, we use a changepoint model such that:

λkN(λkμ1,σ2)if  kLλkN(λkμ2,σ2)o.w. \begin{align} \lambda_k&\sim \mathcal{N}(\lambda_k|\mu_1,\sigma^2) & \mathrm{if}\;k\leq L \\ \lambda_k&\sim \mathcal{N}(\lambda_k|\mu_2,\sigma^2) & \mathrm{o.w.}\ \end{align}

It’s important that σ2\sigma^2 be the same to avoid overfitting in the case on regime has more data than the other. Within each of the two regimes, we assume λj\lambda_j are iid, which is obviously incorrect but adequate for our present purposes.

We can fit this model for each L=1:LmaxL=1:L^{\mathrm{max}} by partitioning the data and computing the MLEs, using a pool estimate of the variance:

μ1(L)=kLλkLμ2(L)=k>LλkLmaxLσ2(L)=kL(λkμ1(L))2+k>L(λkμ2(L))2Lmax\begin{align} \mu_1(L)&=\frac{\sum_{k\leq L}\lambda_k}{L} \\ \mu_2(L)&=\frac{\sum_{k>L}\lambda_k}{L^{max}-L} \\ \sigma^2(L)&= \frac{\sum_{k\leq L}(\lambda_k-\mu_1(L))^2+\sum_{k> L}(\lambda_k-\mu_2(L))^2}{L^{max}} \end{align}

We can then evaluate the profile log likelihood:

(L)=kLlogN(λkμ1(L),σ2(L))+k>LlogN(λkμ2(L),σ2(L))\ell(L)=\sum_{k\leq L}\log \mathcal{N}(\lambda_k|\mu_1(L),\sigma^2(L))+\sum_{k>L}\log \mathcal{N}(\lambda_k|\mu_2(L),\sigma^2(L))

Screen Shot 2023-11-03 at 10.00.54.png

We that the peak L=arg maxL(L)L^*=\argmax_L \ell(L) is well determined.