Proba ML
7. Linear Algebra
7.4 Eigenvalue Decomposition (EVD)

7.4 Eigenvalue decomposition (EVD)

7.4.1 Basics

For a square matrix ARn×nA\in \mathbb{R}^{n\times n}, λR\lambda\in\mathbb{R} is an eigenvalue of AA if:

Au=λu,    u0A\bold{u}=\lambda \bold{u},\;\;\bold{u}\neq\bold{0}

with uRn\bold{u}\in \mathbb{R}^n eigenvector.

Intuitively, multiplying u\bf u by AA results in a vector in the same direction that u\bf u, scaled by λ.\lambda.

Since cuc\bold{u} is also an eigenvector of λ\lambda, we consider eigenvectors to be normalized with length 1.

The equation above gives us:

Auλu=(AλI)u=0A\bold{u}-\lambda \bold{u}=(A-\lambda I)\bold{u}=0

Therefore, this has a non-zero solution iff AλIA-\lambda I has a non-empty nullspace, which is only the case if this matrix is singular:

det(AλI)=0\mathrm{det}(A-\lambda I)=0

Properties:

tr(A)=i=1nλidet(A)=i=1nλi\begin{align} \mathrm{tr}(A)&=\sum_{i=1}^n\lambda_i \\ \mathrm{det}(A)&=\prod_{i=1}^n\lambda_i \end{align}
  • The rank of AA is equal to its number of non-zero eigenvalues
  • If AA is non-singular, 1/λi1/\lambda_i is an eigenvalue of A1A^{-1} with corresponding eigenvectors: A1ui=1/λiuiA^{-1}\bold{u}_i=1/\lambda_i \bold{u}_i
  • The eigenvalues of a triangular matrix are just its diagonal entries.

7.4.2 Diagonalization

We can write all eigenvectors equations simultaneously:

AU=UΛAU=U\Lambda

with URn×,nU\in\mathbb{R}^{n\times, n} and Λ=diag(λ1,,λn)\Lambda=\mathrm{diag}(\lambda_1, \dots,\lambda_n)

If the eigenvectors are linearly independent, then UU is invertible and:

A=UΛU1A=U\Lambda U^{-1}

A matrix that can be written in this form is called diagonalizable.

7.4.3 Eigenvalues of symmetric matrices

When AA is real and symmetric, it can be shown that its eigenvalues are real and its eigenvectors are orthonormal:

uiuj=0,  ijuiui=1\begin{align} u^\top_i u_j&=0, \;i\neq j \\ u^\top_i u_i&=1 \\ \end{align}

So in matrix notation: UU=UU=IU^\top U=UU^\top=I, hence UU is an orthogonal matrix.

We can represent AA as:

A=UΛU=i=1nλiuiuiA=U\Lambda U^\top=\sum_{i=1}^n\lambda_i \bold{u}_i\bold{u}_i^\top

Thus multiplying by any symmetrical matrix AA can be viewed as multiplying by a rotation matrix UU, a scaling matrix Λ\Lambda and inverse rotation with UU^\top.

Since U=U1U^\top=U^{-1}, AA is easily invertible:

A1=UΛ1U=i=1nλi1uiuiA^{-1}=U\Lambda^{-1}U^\top=\sum_{i=1}^n \lambda_i^{-1} \bold{u}_i \bold{u}_i^\top

7.4.4 Geometry of quadratic forms

AA is positive definite iff i{1,,n},λi>0\forall i \in\{{ 1,\dots,n \}}, \lambda_i>0:

f(x)=xAx=xUΛUx=yΛy=i=1nλiyi2f(\bold{x})=\bold{x}^\top A\bold{x} =\bold{x}^\top U \Lambda U^\top \bold{x}=\bold{y}^\top \Lambda \bold{y} =\sum_{i=1}^n \lambda_i \bold{y}_i^2

f(x)f(\bold{x}) is called a quadratic form.

In 2d, we have:

λ1y12+λ2y22=r\lambda_1 y^2_1 +\lambda_2y^2_2=r

Screen Shot 2023-03-26 at 12.25.09.png

7.4.5 Standardizing and whitening data

We can standardize XRN×DX\in\mathbb{R}^{N\times D} so that each column has a mean of 0 and a variance of 1. To remove the correlations between the columns, we need to whiten the data.

Let Σ=1nXX=EDE\Sigma=\frac{1}{n}X^\top X=EDE^\top be the empirical covariance matrix and its diagonalization.

We can define the PCA whitening matrix as:

Wpca=D1/2EW_{pca}=D^{-1/2}E^\top

Let y=Wpcax\bold{y}=W_{pca} \bold{x}, we can check that:

Cov[y]=WE[xx]W=WΣW=D1/2E(EDE)ED1/2=I\mathrm{Cov}[\bold{y}]=WE[\bold{xx}^\top]W^\top=W \Sigma W^\top=D^{—1/2}E^\top (EDE^\top)E D^{-1/2}=I

The whitening matrix is not unique, since any rotation W=RWpcaW=RW_{pca} will maintain the whitening property Σ1=WW\Sigma^{-1} =W^\top W.

For example, with R=E:R=E:

Wzca=ED1/2E=Σ1/2=VS1VW_{zca}=ED^{-1/2}E^\top=\Sigma^{-1/2}=VS^{—1}V^\top

with, U,S,VU,S,V the SVD decomposition of XX.

This is called the Mahalanobis whitening or ZCA (zero-phase component analysis). The advantage of ZCA over PCA is that the resulting matrix is closer to the original matrix —in the least square sense.

When applied to images, the ZCA vectors still look like images.

Screen Shot 2023-03-26 at 12.26.22.png

7.4.6 Power method

This method is useful to compute the distribution of massive transition matrices (like Google PageRank).

Let AA be a matrix with orthonormal eigenvectors ui\bold{u}_i and eigenvalues λ1>λm0|\lambda_1|>\dots\geq|\lambda_m|\geq0

Let v(o)v_{(o)} be an arbitrary vector in the range of AA, such that Ax=v(o)A\bold{x}=v_{(o)}.

We have:

v0=U(ΛUx)=a1u1++amumv_{0}=U(\Lambda U^\top x)=a_1 \bold{u}_1 +\dots+a_m\bold{u}_m

So by multiplying with AA and renormalizing at each iteration:

vtAvt1v_{t}\propto Av_{t-1}

So:

vta1λ1tu1++amλmtum=λ1t(a1u1++am(λmλ1)tum)λ1ta1u1\begin{align} v_{t}&\propto a_1 \lambda_1^t\bold{u}_1+\dots+a_m\lambda_m^t\bold{u}_m \\ &= \lambda_1^t\big(a_1\bold{u}_1+\dots+a_m(\frac{\lambda_m}{\lambda_1})^t\bold{u}_m\big) \\ &\rightarrow \lambda_1^ta_1 \bold{u}_1 \end{align}

To compute λ1\lambda_1, we define the Reyleight coefficient:

R(A,ui)uiAuiuiui=λiuiuiuiui=λiR(A,\bold{u}_i)\triangleq\frac{\bold{u}_i^\top A \bold{u}_i}{\bold{u}_i^\top \bold{u}_i}=\frac{\lambda_i\bold{u}^\top_i\bold{u}_i}{\bold{u}_i^\top\bold{u}_i}=\lambda_i

Therefore in our case:

R(A,vt)=λ1R(A,v_{t})=\lambda_1

7.4.7 Deflation

Suppose we have computed λ1\lambda_1 using the power method, we now compute the subsequent λi\lambda_i.

We project out the u1\bold{u}_1 components from AA:

A(2)=(Iu1u1)A(1)=A(1)λ1u1u1A^{(2)}=(I-\bold{u}_1\bold{u}_1^\top)A^{(1)}=A^{(1)}-\lambda_1 \bold{u}_1\bold{u}_1^\top

This is called matrix deflation. We can now run the power method on A(2)A^{(2)} to find λ2\lambda_2, the largest eigenvalue on the subspace orthogonal to u1\bold{u}_1.

Deflation can be used to implement PCA, by computing the first KK eigenvectors of a covariance matrix.

7.4.8 Eigenvectors optimize quadratic forms

Consider the following constrained optimization problem:

maxxRn  xAx,      subject  to  x22=1\max_{x\in\mathbb{R}^n}\;\bold{x}^\top A\bold{x}, \;\;\;\mathrm{subject\;to\;}||\bold{x}||^2_2=1

for a symmetric matrix ASnA\in\mathbb{S}^n.

A standard way to solve a constrained optimization problem is by forming the Lagrangian:

L(x,λ)=xAx+λ(1xx)\mathcal{L}(\bold{x},\lambda)=\bold{x}^\top A\bold{x}+\lambda(1-\bold{x}^\top \bold{x})

Therefore:

xL(x,λ)=2Ax2λx=0\nabla_x\mathcal{L}(\bold{x},\lambda)=2A^\top\bold{x}-2\lambda \bold{x}=0

So the problem is reduced to Ax=λxA\bold{x}=\lambda \bold{x}, and the only solutions to maximize xAx\bold{x}^\top A\bold{x} given xx=1\bold{x}^\top\bold{x}=1 are the eigenvectors of AA.