Proba ML
7. Linear Algebra
7.8 Matrix Calculus

7.8 Matrix calculus

7.8.1 Derivatives

For a scalar-argument function f:RRf:\mathbb{R}\rightarrow \mathbb{R}, we define its derivative at a point xx the quantity:

f(x)=limh0f(x+h)f(x)hf'(x)=\lim_{h\rightarrow 0}\frac{f(x+h)-f(x)}{h}

7.8.2 Gradients

We can extend this definition to vector-argument function f:RnRf:\mathbb{R}^n\rightarrow \mathbb{R}, by defining the partial derivative of ff w.r.t i:i:

fxi=limh0f(x+hei)f(x)h\frac{\partial f}{\partial x_i}=\lim_{h\rightarrow 0} \frac{f(\bold{x}+h\bold{e}_i)-f(\bold{x})}{h}

where ei\bold{e}_i is the iith unit vector.

The gradient of a function is its vector of partial derivatives:

g=fx=f=(fx1fxn)g=\frac{\partial f}{\partial \bold{x}}=\nabla f = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}

Where the operator \nabla maps a function f:RnRf:\mathbb{R}^n\rightarrow \mathbb{R} to another function g:RnRn.g:\mathbb{R}^n\rightarrow \mathbb{R}^n.

The point at which the gradient is evaluated is noted:

g(x)fxxg(\bold{x}^*)\triangleq \frac{\partial f}{\partial \bold{x}}\Bigg |_{\bold{x}^*}

7.8.3 Directional derivative

The directional derivative measure how much f:RnRf:\mathbb{R}^n\rightarrow\mathbb{R} changes along a direction in space v\bold{v}:

Dvf(x)=limh0f(x+vh)f(x)hD_\bold{v}f(\bold{x})=\lim_{h\rightarrow 0}\frac{f(\bold{x}+\bold{v}h)-f(\bold{x})}{h}

Note that:

Dvf(x)=f(x).vD_{\bold{v}}f(\bold{x})=\nabla f(\bold{x}).\bold{v}

7.8.4 Total derivative

Suppose the function has the form f(t,x(t),y(t))f(t,x(t),y(t)), we define the total derivative w.r.t tt as:

dfdt=ft+fxxdt+fyydt\frac{df}{dt}=\frac{\partial f}{\partial t}+\frac{\partial f}{\partial x}\frac{\partial x}{dt}+\frac{\partial f}{\partial y}\frac{\partial y}{dt}

Multiplying by dtdt, we get the total differential:

df=ftdt+fxx+fyydf=\frac{\partial f}{\partial t}dt+\frac{\partial f}{\partial x}\partial x+\frac{\partial f}{\partial y}\partial y

This represents how much ff changes when we change tt.

7.8.5 Jacobian

Consider f:RnRm\bold{f}:\mathbb{R}^n\rightarrow \mathbb{R}^m. The Jacobian of this function is an m×nm \times n matrix of partial derivatives:

Jf(x)=fx(f1x1f1xnfmx1fmxn)=(f1(x)fm(x))J_\bold{f}(\bold{x})=\frac{\partial \bold{f}}{\partial \bold{x}^\top}\triangleq \begin{pmatrix} \frac{\partial f_1}{\partial x_1}&\dots& \frac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1}&\dots& \frac{\partial f_m}{\partial x_n} \end{pmatrix}= \begin{pmatrix} \nabla f_1(\bold{x})^\top \\ \vdots \\ \nabla f_m(\bold{x})^\top \end{pmatrix}

7.8.5.1 Vector Product

The Jacobian vector product (JVP) is right multiplying JRm×nJ\in\mathbb{R}^{m\times n} by vRnv\in\mathbb{R^n}:

Jf(x).v=(f1(x).vfm(x).v)J_\bold{f}(\bold{x}).\bold{v}=\begin{pmatrix} \nabla f_1(\bold{x})^\top.\bold{v} \\ \vdots \\ \nabla f_m(\bold{x})^\top.\bold{v} \end{pmatrix}

Similarly, the vector Jacobian product (VJP) is left multiplying by uRmu\in\mathbb{R}^m:

uJf(x)=u(fx1fx1)=(ufx1fx1)\bold{u}^\top J_{\bold{f}}(\bold{x})= \bold{u}^\top\begin{pmatrix}\frac{\partial \bold{f}}{\partial x_1}&\dots & \frac{\partial \bold{f}}{\partial x_1} \end{pmatrix}= \begin{pmatrix}\bold{u}^\top\frac{\partial \bold{f}}{\partial x_1}&\dots & \frac{\partial \bold{f}}{\partial x_1} \end{pmatrix}

7.8.5.2 Composition of feature

The Jacobian of the composition of two features h(x)=g(f(x))h(\bold{x})=g(f(\bold{x})) is obtained with the chain rule:

Jh(x)=Jg(f(x))Jf(x)J_h(\bold{x})=J_g(f(\bold{x}))J_f(\bold{x})

Let f:RR2f:\mathbb{R}\rightarrow \mathbb{R}^2 and g:R2R2g:\mathbb{R}^2\rightarrow \mathbb{R}^2, we have:

gx=gffx=(g1f1g1f2g2f1g2f2)(f1xf2x)\frac{\partial \bold{g}}{\partial x}=\frac{\partial \bold{g}}{\partial \bold{f}}\frac{\partial \bold{f}}{\partial x}= \begin{pmatrix} \frac{\partial g_1}{\partial f_1} & \frac{\partial g_1}{\partial f_2} \\ \frac{\partial g_2}{\partial f_1} & \frac{\partial g_2}{\partial f_2} \end{pmatrix} \begin{pmatrix} \frac{\partial f_1}{\partial x} \\ \frac{\partial f_2}{\partial x} \end{pmatrix}

7.8.6 Hessian

For f:RnRf:\mathbb{R}^n\rightarrow\mathbb{R} that is twice differentiable, the Hessian is the n×nn\times n symmetric matrix of second partial derivatives:

Hf=2fx2=(2fx122fx1xn2fxnx12fxn2)H_{f}=\frac{\partial^2 f}{\partial \bold{x}^2}= \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \dots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\ & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \dots & \frac{\partial^2 f}{\partial x_n^2} \end{pmatrix}

The Hessian is the Jacobian of the gradient.

7.8.7 Gradients of commonly used functions

7.8.7.2 Functions that map vectors to scalars

axx=abAxx=AbxAxx=(A+A)x\begin{align} \frac{\partial a^\top x}{\partial x}&=a \\ \frac{\partial b^\top A x}{\partial x}&=A^\top b \\ \frac{\partial x^\top Ax}{\partial x}&=(A+A^\top)x \end{align}

7.8.7.3 Functions that map matrices to scalar

Quadratic forms:

aXbX=abaXbX=ba\begin{align} \frac{\partial a^\top Xb}{\partial X}&=ab^\top \\ \frac{\partial a^\top X^\top b}{\partial X}&=ba^\top \end{align}

Traces:

Xtr(AXB)=ABXtr(XA)=AXtr(X1A)=XTAXTXtr(XAX)=(A+A)X\begin{align} \frac{\partial}{\partial X}\mathrm{tr}(AXB)&=A^\top B^\top \\ \frac{\partial}{\partial X}\mathrm{tr}(X^\top A) &=A \\ \frac{\partial}{\partial X}\mathrm{tr}(X^{-1}A)&= -X^{-T}AX^{-T} \\ \frac{\partial }{\partial X} \mathrm{tr}(X^\top AX)&=(A+A^\top)X \end{align}

Determinants:

Xdet(AXB)=det(AXB)XTXln(det(X))=XT\begin{align} \frac{\partial}{\partial X}\mathrm{det}(AXB)&=\mathrm{det}(AXB)X^{-T} \\ \frac{\partial}{\partial X}\ln(\mathrm{det(X))}&=X^{-T} \end{align}