3. Probability: Multivariate Models

3.1 Joint distributions for multiple random variable

Cov[X,Y]=E[(XE(X))(YE(Y))]=E[XY]E[X]E[Y]Cov[X,Y]=\mathbb{E}[(X-\mathbb{E}(X))(Y-\mathbb{E}(Y))^\top]=E[XY^\top]-E[X]E[Y]

If xRD\bold{x}\in \mathbb{R}^D:

Cov[x]=E[(xE[x])(xE[x])]=ΣCov[\bold{x}]=\mathbb{E}[(\bold{x}-\mathbb{E}[\bold{x}])(\bold{x}-\mathbb{E}[\bold{x}])^\top]=\bold{\Sigma}

E[xx]=Σ+μμ\mathbb{E}[\bold{x}^\top\bold{x}]=\bold{\Sigma} + \mu\mu^\top

Cov[Ax+b]=AΣACov[\bold{Ax+b}]=\bold{A \Sigma A^\top}

ρ=corr(X,Y)=Cov(X,Y)σXσY\rho=corr(X,Y)=\frac{Cov(X,Y)}{\sigma_X\sigma_Y}

3.2 Multivariate Gaussian (MVN)

3.2.2 Mahalanobis distance

pdf of the MVN:

N(yμ,Σ)=1(2π)D/2Σ1/2exp[12(yμ)Σ1(yμ)]\mathcal{N}(\bold{y}|\bold{\mu}, \bold{\Sigma})=\frac{1}{(2\pi)^{D/2}|\Sigma|^{1/2}}exp[-\frac{1}{2}(\bold{y}-\bold{\mu})^\top\bold{\Sigma^{-1}}(\bold{y}-\mu)]

with μRD\bold{\mu} \in \mathbb{R}^D and ΣRD×D\bold{\Sigma} \in \mathbb{R}^{D \times D}

In 2D the pdf become the bivariate Gaussian distribution:

Σ=[σ12ρσ1σ2ρσ1σ2σ22]\bold{\Sigma}=\begin{bmatrix} \sigma_1^2 &\rho \sigma_1\sigma_2 \\\rho\sigma_1\sigma_2 & \sigma_2^2\end{bmatrix}

with ρ=σ122σ1σ2\rho=\frac{\sigma_{12}^2}{\sigma_1\sigma_2}

Mahalanobis distance:

Δ2=(yμ)Σ1(yμ)\Delta^2=\bold{(y-\mu)^\top\Sigma^{-1}(y-\mu)}

Eigendecomposition of the inverse Cov matrix:

Σ1=i=1Duiui1λi\bold{\Sigma}^{-1}=\sum_{i=1}^Du_iu_i^{\top} \frac{1}{\lambda_i}

We can express the Mahalanobis distance through a transformation:

(yμ)Σ1(yμ)=(yμ)i1λiuiui(yμ)=zi2λi\bold{(y-\mu)^\top\Sigma^{-1}(y-\mu)}=\bold{(y-\mu)^\top}\sum_i \frac{1}{\lambda_i}u_iu_i^\top\bold{(y-\mu)}=\sum \frac{z_i^2}{\lambda_i}

with z=U(yμ)\bold{z=U(y-\mu)}

Thus the Mahalanobis distance can be seen as Euclidean distance with a rotation of u\bold{u} and a translation of μ\bold{\mu}.

3.2.3 Marginals and conditionals

Let y=(y1,y2)\bold{y=(y_1,y_2)} be jointly Gaussian

Its parameters are:

μ=[μ1μ2]\mu= \bold{\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}} and Σ=[Σ11Σ12Σ21Σ22]\bold{\Sigma=\begin{bmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21}& \Sigma_{22}\end{bmatrix}}

Marginals are given by:

p(y1)=N(y1μ1,Σ11)p(y2)=N(y2μ2,Σ22)p(y_1)=\bold{\mathcal{N}(y_1|\mu_1,\Sigma_{11})} \\ p(y_2)=\bold{\mathcal{N}(y_2|\mu_2,\Sigma_{22})} 

Posterior conditional is given by:

p(y1y2)=N(y1μ12,Σ12)μ12=μ1+Σ12Σ221(y2μ2)Σ12=Σ11Σ12Σ221Σ21p(\bold{y_1|y_2)=\mathcal{N}(y_1|\mu_{1|2},\Sigma_{1|2}}) \\\mu_{1|2}=\mu_1+\Sigma_{12}\Sigma_{22}^{-1}(y_2-\mu_2) \\ \Sigma_{1|2}=\Sigma_{11}-\Sigma_{12}\Sigma^{-1}_{22}\Sigma_{21}

Example in the 2d case:

Σ=[σ12ρσ1σ2ρσ1σ2σ22]\Sigma=\begin{bmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2\end{bmatrix}

Suppose we observe Y2=y2Y_2=y_2, the conditional is obtained by slicing the joint distribution through the Y2=y2Y_2=y_2 line:

p(y1y2)=N(y1μ1+ρσ1σ2σ22(y2μ2),σ12(ρσ1σ2)2σ22)p(y_1|y_2)=\mathcal{N}(y_1|\mu_1+\frac{\rho\sigma_1\sigma_2}{\sigma_2^2}(y_2-\mu_2),\sigma_1^2-\frac{(\rho \sigma_1 \sigma_2)^2}{\sigma_2^2})

3.3 Linear Gaussian System

Let zRL\bold{z} \in \mathbb{R}^L  be an unknown vector of values, yRDy \in \mathbb{R}^D some noisy measurements of z\bold{z}.

These variable are related as follow:

p(z)=N(zμz,Σz)p(z)=\mathcal{N}(z|\mu_z,\Sigma_z)

p(yz)=N(yWz+b,Σy)p(y|z)=\mathcal{N}(y|Wz+b,\Sigma_y)

with WRD×LW \in \mathbb{R}^{D \times L}

The joint distribution of this linear Gaussian system is a L+DL+D dimensional Gaussian:

p(z,y)=p(z)p(yz)p(z,y)=p(z)p(y|z) with mean and covariance given by

μ=[μzWμz+b]Σ=[ΣzΣzWWΣzΣy+WΣzW]\mu=\begin{bmatrix} \mu_z \\ W\mu_z+b\end{bmatrix}\\ \Sigma=\begin{bmatrix} \Sigma_z & \Sigma_z W^\top \\ W \Sigma_z & \Sigma_y+ W\Sigma_zW^\top\end{bmatrix}

The Bayes rule for Gaussian give the posterior over the latent as:

p(zy)=N(zμzy,Σzy)p(z|y)=\mathcal{N}(z|\mu_{z|y},\Sigma_{z|y})

Σzy1=Σz1+WΣy1W\Sigma_{z|y}^{-1}=\Sigma_z^{-1}+W^\top\Sigma_y^{-1}W

μzy=Σzy[WΣy1(yb)+Σz1μz]\mu_{z|y}=\Sigma_{z|y}[W^{\top}\Sigma_y^{-1}(y-b)+\Sigma_z^{-1}\mu_z]

3.4 The exponential family

p(yϕ)=h(y)ef(ϕ)T(y)A(f(ϕ))p(\bold{y}|\phi)=h(\bold{y})e^{f(\phi)^\top\mathcal{T}(\bold{y})-A(f(\phi))}

This becomes:

p(yϕ)=h(y)eϕyA(ϕ)p(\bold{y}|\phi)=h(\bold{y})e^{\phi^\top \bold{y}-A(\phi)}

The first and second cumulants of a distribution are E[Y]\mathbb{E}[Y] and V[Y]\mathbb{V}[Y]

This first and second moments are E[Y]\mathbb{E}[Y] and E[Y2]\mathbb{E}[Y^2]

Important property of the exponential family: derivatives of the log partition function yields the cumulants of the sufficient statistics

A(ϕ)=E[T(y)]\nabla A(\phi)=\mathbb{E}[\mathcal{T(\bold{y})}]

2A(ϕ)=Cov[T(y)]\nabla^2A(\phi)=Cov[\mathcal{T(\bold{y})}]

Since the Hessian is pos-def and the log likelihood has the form ϕT(y)A(ϕ)\phi^\top \mathcal{T}(\bold{y})-A(\phi), it is concave and hence the MLE has a unique global max.

3.5 Mixture models

k components, the distribution has the form:

p(yθ)=k=1Kπkpk(y)p(\bold{y|\theta)}=\sum^K_{k=1}\pi_kp_k(\bold{y})

with k=1Kπk=1\sum_{k=1}^K\pi_k=1

This model can be expressed as a hierarchical model in which we introduce a latent variable z={1,...,K}z=\{1,...,K\} which specificies which distribution to use to model y\bold{y}

By marginalizing out zz:

p(yθ)=k=1Kp(yz=k,θ)p(z=kθ)=k=1Kp(yθk)πkp(\bold{y}|\theta)=\sum_{k=1}^Kp(y|z=k,\theta)p(z=k|\theta)=\sum^{K}_{k=1}p(y|\theta_k)\pi_k

Gaussian mixture:

p(yθ)=kπkN(yμk,Σk)p(\bold{y}|\theta)=\sum_k\pi_k\mathcal{N(\bold{y}|\mu_k,\Sigma_k)}

GMM works in 2 phase:

  1. Fit the model, ie computing MLE θ^=argmaxθlogp(Dθ)\hat{\theta}=\mathrm{argmax}_{\theta}\log\,p(\mathcal{D}|\theta)
  1. Associate each ynDy_n\in D to a latent variable zn{1,...,K}z_n\in\{1,...,K\} to specify the identity of the mixture component (or cluster) used to generate yny_n

    We can compute the posterior of this latent identities:

    p(znyn,θ)=p(ynzn=k,θ)p(zn=kθ)kp(ynzn=k,θ)p(zn=kθ)p(z_n|y_n,\theta)=\frac{p(y_n|z_n=k,\theta)p(z_n=k|\theta)}{\sum_{k'}p(y_n|z_n=k', \theta)p(z_n=k'|\theta)}

    z^n=arg maxklog[p(ynzn=k,θ)+p(zn=kθ)]\hat{z}_n=\argmax_k \log\big[ p(y_n|z_n=k,\theta)+p(z_n=k|\theta)\big]

With uniform prior over znz_n and we use spherical Gaussian with Σk=I\Sigma_k=I, the problem become a Kmeans clustering:

zn=arg minkynμ^k22z_n=\argmin_k||y_n-\hat{\mu}_k||_2^2

Bernoulli Mixture models

p(yz=k,θ)=d=1DBer(ydμdk)=d=1Dμdkyd(1μdk)1ydp(\bold{y}|z=k,\theta)=\prod_{d=1}^{D}Ber(y_d|\mu_{dk})=\prod_{d=1}^{D}\mu_{dk}^{y_d}(1-\mu_{dk})^{1-y_d}

μdk\mu_{dk} is the probability that bit dd turns on in cluster kk

3.6 Probabilistic Graphical Models

When the model is a Direct Acyclic Graph (DAG) we call it a Bayesian Network (even though nothing inherently Bayesian about it)

Ordered Markov Property: node given parents are independent of predecessors

YiYpred(i)Yparents(i)Y_i \perp Y_{pred(i)}|Y_{parents(i)}

The joint distribution can be represented as:

p(Yi:V)=i=1Vp(YiYparents(i))p(Y_{i:V})=\prod_{i=1}^Vp(Y_i|Y_{parents(i)})

Markov Chain or autoregressive model of order 1, the pp function is the transition function or Markov kernel

p(y1:T)=p(y1)t=2Tp(ytyt1)p(\bold{y}_{1:T})=p(\bold{y}_1)\prod_ {t=2}^T p(y_t|y_{t-1})

Generalization to m-order:

p(y1:T)=p(y1:M)t=M+1Tp(ytytM:t1)p(\bold{y}_{1:T})=p(\bold{y}_{1:M})\prod^T_{t=M+1}p(y_t|y_{t-M:t-1})

It the parameters of the Conditional Probability Distribution (CPD) are unknown, we can view them as additional random variables and threat them as hidden variables to be inferred.

θp(θ)\theta \sim p(\theta) some unspecified prior over the parameters

ynp(yθ)\bold{y}_n \sim p(\bold{y}|\theta) some specified likelihood

The joint distribution is has the form:

p(D,θ)=p(θ)p(Dθ)p(\mathcal{D},\theta)=p(\theta)p(\mathcal{D}|\theta), where D=(y1,...,yN)\mathcal{D}=(\bold{y}_1,..., \bold{y}_N)

p(Dθ)=n=1Np(ynθ)p(\mathcal{D}|\theta)=\prod_{n=1}^Np(\bold{y}_n|\theta)

The figure below encodes the representation:

p(y1:N,z1:N,θ)=p(π)[k=1Kp(μk)p(Σk)][n=1Np(znπ)p(ynzn,μ1:K,Σ1:K)]p(\bold{y}_{1:N}, z_{1:N}, \theta)=p(\pi)\Big[\prod^K_{k=1}p(\mu_k)p(\bold{\Sigma_k})\Big]\Big[\prod_{n=1}^N p(z_n|\pi)p(y_n|z_n,\bold{\mu}_{1:K},\bold{\Sigma}_{1:K})\Big]