Proba ML
7. Linear Algebra
7.1 Intro

7.1 Introduction

7.1.1 Notation

Vector

A vector xRn\bold{x}\in \mathbb{R}^n is usually written as a column vector:

x=[x1xn]\bold{x}=\begin{bmatrix}x_1 \\ \vdots \\x_n\end{bmatrix}

The vector of all ones is 1\bold{1}, of all zeros is 0\bold{0}.

The unit or one-hot vector ei\bold{e}_i is a vector of all zeros except entry ii which is 11.

ei=(0,,0,1,0,,0)e_i=(0,\dots, 0,1,0,\dots,0)

Matrix

A matrix ARn×m\bold{A} \in \mathbb{R}^{n\times m} is defined similarly. If n=mn=m the matrix is square.

The transpose of a matrix results in flipping its row and columns: (AT)ij=Aji(A^T)_{ij}=A_{ji}. Some of its properties:

  • (AT)T=A(A^T)^T=A
  • (A+B)T=AT+BT(A+B)^T=A^T+B^T
  • (AB)T=BTAT(AB)^T=B^TA^T

A symmetric square matrices satisfies AT=AA^T=A. The set of all symmetric matrices of size nn is denoted Sn\Bbb S^n.

Tensor

A tensor is a generalization of a 2d array:

Screen Shot 2023-02-06 at 08.40.47.png

The rank of a tensor is its number of dimensions.

We can stack rows (resp columns) of a matrix horizontally (resp vertically) to get a C-contiguous or row-major order (resp F-contiguous or columns-major order).

Screen Shot 2023-02-06 at 08.43.14.png

7.1.2 Vector spaces

Spans

We can view a vector xRn\bold{x} \in \Bbb R^n as a point in n-dimensional Euclidean space. A vector space is a collection of such vector, which can be added together and scale by scalar cRc\in\Bbb R.

A set of vector {x1,...,xn}\{\bold{x}_1,...,\bold{x}_n\} is linearly independent if no vector can be represented as a linear combination of the others.

The span of {x1,...,xn}\{\bold{x}_1,...,\bold{x}_n\} is the set of all vectors that can be obtained its linear combinations:

span({x1,...,xn}){v:v=i=1nαixi  ,αiR}\mathrm{span}(\{\bold{x}_1,...,\bold{x}_n\})\triangleq \Big\{v:v=\sum_{i=1}^n \alpha_i \bold{x}_i \;,\alpha_i\in\Bbb R \Big\}

It can be shown that if each xiRn\bold{x_i} \in \mathbb{R}^n form a set of linearly independent vectors, then:

span({x1,...,xn})=Rn\mathrm{span}(\{\bold{x}_1,...,\bold{x}_n\})=\mathbb{R}^n

and therefore any vectors of Rn\mathbb{R}^n can be written as a linear combination of this set of vector.

A basis B\mathcal{B} is a set of linearly independent vector that spans the whole space:

span(B)=Rn\mathrm{span}(\mathcal{B})=\R^n

There are often multiple basis to choose from, a standard one is the coordinate vectors ei\bold{e}_i.

Linear maps

A linear map is the transformation f:VWf:\mathcal{V\rightarrow W} such that:

  • f(v+w)=f(v)+f(w),  v,wVf(v+w)=f(v)+f(w),\; \forall v,w\in\mathcal{V}
  • f(av)=af(v)f(av)=af(v)

Once the basis of V\mathcal{V} is chosen, a linear map is completely determined by computing the image basis of W\mathcal{W}.

Suppose V=Rn,W=Rm,\mathcal{V}=\mathbb{R}^n,\mathcal{W}=\mathbb{R}^m, we can compute f(xi)Rmf(\bold{x}_i)\in\mathbb{R}^m and store it in ARm×n\bold{A} \in \mathbb{R}^{m \times n}.

We can then compute any yRmy\in \mathbb{R}^m for any xRn\bold{x}\in\mathbb{R}^n by:

y=(i=1na1,ixi,...,i=1nam,ixi)=Axy=\Big(\sum_{i=1}^n a_{1,i} x_i, ...,\sum_{i=1}^n a_{m,i} x_i \Big)=\bf Ax

The range of A\bold{A} the space reached by the matrix A\bold{A}:

range(A){vRm:v=Ax,  xRn}\mathrm{range}(\bold{A}) \triangleq\{\bold{v}\in \mathbb{R}^m:\bold{v}=\bold{Ax},\;\bold{x}\in\mathbb{R}^n\}

The nullspace is:

nullspace(A){xRn:Ax=0}\mathrm{nullspace}(\bold{A})\triangleq \{\bold{x}\in \mathbb{R}^n:\bold{Ax=0}\}

Linear projection

The projection of yRm\bold{y} \in \mathbb{R}^m onto the span {x1,,xn}\{\bold{x}_1,\dots,\bold{x}_n\} with xiRm\bold{x}_i \in\mathbb{R}^m, is:

v=Proj(y,{x1,...,xn})=arg minvspan({x1,...,xn})vy2\bold{v}=\mathrm{Proj}(\bold{y},\{\bold{x}_1,...,\bold{x}_n\})=\argmin_{\bold{v} \in \mathrm{span}(\{x_1,...,x_n\})}||\bold{v}-\bold{y}||_2

And for a full rank matrix Am×nA^{m\times n}, with mnm\geq n , we can define the projection of yRm\bold{y}\in\mathbb{R}^m on the range of A:A:

v=Proj(y,A)=arg minvR(a)vy2=A(AA)1Ay\bold{v}=\mathrm{Proj}(\bold{y},A)=\argmin_{\bold{v}\in \mathcal{R}(a)}||\bold{v-y}||_2=A(A^\top A)^{-1}A^\top y

7.1.3 Vector and matrix norms

Vector norms

A norms is any function f:RmRf:\mathbb{R}^m\rightarrow \mathbb{R} that satisfies the following properties:

  • non-negativity: xRn,f(x)0\forall \bold{x} \in \mathbb{R}^n,f(\bold{x})\geq0
  • definiteness: f(x)=0    x=0f(\bold{x})=0\iff \bold{x=0}
  • absolute value homogeneity: tR,f(tx)=tf(x)\forall t\in\mathbb{R},f(t\bold{x})=|t|f(\bold{x})
  • triangle inequality: f(x+y)f(x)+f(y)f(\bold{x}+\bold{y})\leq f(\bold{x})+f(\bold{y})

Common norms:

  • p-norm: xp=(i=1Nxip)1/p||\bold{x}||_p=\big(\sum_{i=1}^N |x_i|^p\big)^{1/p}, for p1p\ge 1

  • 1\ell_1 or Manhattan: x1=i=1Nxi||\bold{x}||_1=\sum^N_{i=1}|x_i|

  • 2\ell_2 or Euclidean: x2=i=1Nxi2||\bold{x}||_2=\sqrt{\sum_{i=1}^N x_i^2}

    also note the square euclidean: x22=xx||\bold{x}||^2_2=\bold{x}\bold{x}^\top

  • max-norm or Chebyshev: x=maxixi||\bold{x}||_\infin=\max_i|x_i|

  • zero-norm: x0=i=1NI(xi>0)||\bold{x}||_0=\sum_{i=1}^N\mathbb{I}(|x_i|>0). This is a pseudo norm since it doesn’t respect homogeneity.

Matrix norms

i) Induced norms

Suppose we think of a matrix ARm×nA\in\R^{m\times n} as defining a linear function f(x)=Axf(\bold{x})=A\bold{x}.

We define the induced norm of AA as the maximum amount by which ff can lengthen input:

Ap=maxx0Axpxp=maxx=1Axp||A||_p=\max_{\bold{x\ne0}} \frac{||A\bold{x}||_p}{||\bold{x}||_p}=\max_{||\bold{x||}=1}||A\bold{x}||_p

In the special case of p=2p=2 (the Euclidean norm), the induced norm is the spectral norm:

A2=λmax(AA)=maxiσi||A||_2= \sqrt{\lambda_{max}(A^\top A)}=\max_i \sigma_i

with σi\sigma_i the iith singular value

ii) Element-wise norms

If we treat a m×nm\times n matrix as a vector of size m.nm.n, we can define the pp-norm:

Ap=vec(A)p=(ijaijp)1/p||A||_p=||\mathrm{vec}(A)||_p=\Big(\sum_i\sum_j a_{ij}^p\Big)^{1/p}

when p=2p=2, we get the Frobenius norm:

AF=ijaij2=tr(AA)||A||_F=\sqrt{\sum_i \sum_j a_{ij}^2}=\sqrt{\mathrm{tr}(A^\top A)}

If AA is expensive but AvAv is cheap, we can approximate the F-norm with the Hutchinson trace estimator:

AF2=tr(AA)=E[vAAv]=E[Av22]||A||_F^2=\mathrm{tr}(A^\top A)=\mathbb{E}[v^\top A^\top A v]=\mathrm{E}[||Av||_2^2]

with vN(0,1)v \sim \mathcal{N}(0,1)

iii) Schatten norms

The Schatten pp-norms arise when applying the element-wise pp-norm to the singular values of a matrix:

Ap=(iσip(A))1/p||A||_p= \Big(\sum_i \sigma_i^p(A)\Big)^{1/p}

When p=1p=1, we get the nuclear norm (or trace norm):

A=tr(AA)=iσi=iσi=σ1||A||_*=\mathrm{tr}(\sqrt{A^\top A}) =\sum_i \sigma_i=\sum_i|\sigma_i|=||\sigma||_1

since σi0\sigma_i\geq0.

Using this as a regularizer encourages many singular values to become zero, resulting in a low-rank matrix.

When p=2p=2, we get the Frobenius norm.

When p=p=\infin, we get the Spectral norm

7.1.4 Properties of a matrix

Trace

The trace of a square matrix ARnA\in \mathbb{R}^n is:

tr(A)i=1naii=i=1nλi\mathrm{tr}(A)\triangleq\sum_{i=1}^n a_{ii}=\sum_{i=1}^n \lambda_i

We have the following:

  • tr(A)=tr(A)\mathrm{tr}(A^\top)=\mathrm{tr}(A)
  • tr(A+B)=tr(A)+tr(B)\mathrm{tr}(A+B)=\mathrm{tr}(A)+\mathrm{tr}(B)
  • tr(cA)=ctr(A)\mathrm{tr}(cA)=c\mathrm{tr}(A)
  • tr(AB)=tr(BA)\mathrm{tr}(AB)=\mathrm{tr}(BA)
  • tr(ABC)=tr(BCA)=tr(CAB)\mathrm{tr}(ABC)=\mathrm{tr}(BCA)=\mathrm{tr}(CAB)

This last property allows to define the trace trick:

xAx=tr(xAx)=tr(xxA)\bold{x}^\top A\bold{x}=\mathrm{tr}(\bold{x}^\top A \bold{x})=\mathrm{tr}(\bold{x}\bold{x}^\top A)

In some case, AA might be expensive but AvAv cheap, we can use the Hutchinson trace approximator:

tr(A)=tr(AE[vv])=E[tr(Avv)]=E[tr(vAv)]\mathrm{tr}(A)=\mathrm{tr}(A\mathbb{E}[vv^\top])=\mathbb{E}[\mathrm{tr}(Avv^\top)]=\mathbb{E}[\mathrm{tr}(v^\top A v)]

with E[vv]=I\mathbb{E}[vv^\top]=\bold{I}

Determinant

The determinant is a measure of how much a square matrix change a unit volume when viewed as a linear transformation.

With ARn×nA\in \mathbb{R}^{n\times n}:

  • A=A|A|=|A^\top|
  • cA=cnA|cA|=c^n|A|
  • AB=AB|AB|=|A||B|
  • A=0|A|=0 iff AA is singular
  • If AA is not singular, A1=1/A|A|^{-1}=1/|A|
  • A=i=1nλi|A|=\prod_{i=1}^n\lambda_i, where λi\lambda_i are the eigenvalues of AA

For a positive definite matrix, A=LLA=LL^\top where LL is a lower triangle Cholesky decomposition:

A=LL=L2|A|=|LL^\top|=|L|^2

so

logA=2logL=2logiLii=2tr(log(diag(L)))\log |A|= 2 \log |L|= 2 \log \prod_i L_{ii} = 2 \mathrm{tr}(\log(\mathrm{diag}(L)))

Rank

The column (resp row) rank is the dimension spanned by the columns (resp rows) of a matrix, and its a basic fact that they are the same: rank(A)\mathrm{rank}(A)

  • For ARn×mA \in \mathbb{R}^{n \times m}, rank(A)min(n,m)\mathrm{rank}(A)\leq \min(n,m)

    If equality, AA is said to be full rank (it is rank defficient otherwise).

  • rank(A)=rank(A)=rank(AA)=rank(AA)\mathrm{rank}(A)=\mathrm{rank}(A^\top)=\mathrm{rank}(AA^\top)=\mathrm{rank}(A^\top A)

  • For BRm×pB \in \mathbb{R}^{m\times p}, rank(AB)min(rank(A),rank(B))\mathrm{rank}(AB)\leq \min(\mathrm{rank}(A),\mathrm{rank}(B))

  • For A,BRn×mA,B \in \mathbb{R}^{n\times m}, rank(A+B)rank(A)+rank(B)\mathrm{rank}(A+B)\leq \mathrm{rank}(A)+ \mathrm{rank}(B)

  • A square matrix is invertible iff full rank

Condition number

All norms that follow are 2\ell_2.

The condition number of a matrix measures of numerically stable any computation involving AA will be:

κ(A)A.A11=σmax/σmin=λmax/λmin\begin{align} \kappa(A)&\triangleq||A||.||A^{-1}||\geq1 \\ &= \sigma_{max}/\sigma_{min}=\sqrt{\lambda_{max}/\lambda_{min}} \end{align}

If κ\kappa is close to 1, the matrix is well conditioned (ill-conditioned otherwise).

A large condition means AA is nearly singular. This a better measure to nearness of singularity than the size of the determinant.

For exemple, A=0.1I100×100A=0.1 \bold{I}_{100\times100} leads to A=10100|A|=10^{-100} but κ(A)=1\kappa(A)=1, indicating that even though AA is nearly singular, it is well conditioned and AxA\bold{x} simply scales the entry of x\bold{x} by 0.1.

7.1.5 Special types of matrices

Diagonal matrices All elements are 0 except the diagonal.

diag(A)=diag(d1,...,dn)\mathrm{diag}(A)=\mathrm{diag}(d_1,...,d_n)

We can convert a vector to a diagonal matrix with this notation

Identity matrix I\bold{I} Diagonal of ones, with the identity property:

AI=IA=AA\bold{I}=\bold{I}A=A

Block diagonal matrix

Non zero only on the main diagonal:

[A00B]\begin{bmatrix} \bold{A} &\bold{0} \\ \bold{0} &\bold{B} \end{bmatrix}

Band-diagonal matrix Non-zero entries along k-sides of the diagonal:

Screen Shot 2023-02-18 at 11.44.16.png

Triangular matrix Upper (resp lower) have only non-zero on the diagonal and above (resp below)

Useful property: the diagonal is the eigenvalues, therefore L=iLii|L|=\prod_i L_{ii}

Positive definite matrices

A symmetric matrix ASnA\in\mathbb{S}^n is positive definite iff:

xAx>0\bold{x}^\top A \bold{x} >0

If equality is possible, we call AA positive semi-definite.

The set of all positive definite matrices is denoted S++n\mathbb{S}^n_{++}

A matrix can also be negative definite or indefinite: there exists x1\bold{x}_1 and x2\bold{x}_2 such that:

x1Ax1>0  and  x2Ax2<0\bold{x}_1^\top A\bold{x_1} >0 \;\mathrm{and}\;\bold{x_2} ^\top A \bold{x}_2 <0

Note that a matrix having all positive coefficient doesn’t imply being positive definite ex: [4332]\begin{bmatrix}4 & 3 \\ 3 & 2\end{bmatrix}

Conversely, it can have negative coefficient.

For a symmetric, real matrix, a sufficient condition for positive definiteness is having the diagonal coefficient “dominate” their rows:

i,aii>jiaij\forall i ,|a_{ii}|> \sum_{j\neq i}|a_{ij}|

The Gram-matrix G=AAG=A^\top A is always positive semidefinite.

If ARn,mA\in\mathbb{R}^{n,m} is full rank and mnm\geq n, then GG is positive definite.

Orthogonal matrices

  • x,yRn\bold{x},\bold{y}\in\mathbb{R}^n are orthogonal if xy=0\bold{x^\top y}=0

  • x\bold{x} is normalized if x2=1||\bold{x}||_2=1

  • A set of vectors that is pairwise orthogonal and normalized is called orthonormal

  • URn×nU\in\mathbb{R}^{n\times n} is orthogonal if all its columns are orthonormal (note the difference of notation)

    If all entries of UU are complex valued, we use the term unary instead of orthonormal

If UU is orthogonal:

UU=I=UUU^\top U =\bold{I} =U U^\top

In otherwords, U=U1U^\top=U^{-1}

One nice property is norm invariance when operating on a vector:

Ux2=x2||U\bold{x}||_2=||\bold{x}||_2

with URn×nU\in \mathbb{R}^{n \times n} and xRn\bold{x} \in \mathbb{R}^n non-zero

Similarly, the angle is unchanged after a transformation with an orthogonal matrix:

cos(α(x,y))=xyx.y=(Ux)(Uy)Ux.Uy=cos(α(Ux,Uy))\cos(\alpha(\bold{x,y}))=\frac{\bold{x}^\top \bold{y}}{\bold{||x||}.\bold{|y||}}=\frac{(U\bold{x)}^\top (U\bold{y})}{||U\bold{x}||.||U\bold{y}||}=\cos(\alpha(U\bold{x},U\bold{y}))

In summary, orthogonal matrices are generalization of rotation (if U=1|U|=1) and reflections (if U=1|U|=-1) since they preserves angle and lenghts.