Let z∈RL be an unknown vector of values, y∈RD some noisy measurements of z.
These variable are related as follow:
p(z)=N(z∣μz,Σz)
p(y∣z)=N(y∣Wz+b,Σy)
with W∈RD×L
The joint distribution of this linear Gaussian system is a L+D dimensional Gaussian:
p(z,y)=p(z)p(y∣z) with mean and covariance given by
μ=[μzWμz+b]Σ=[ΣzWΣzΣzW⊤Σy+WΣzW⊤]
The Bayes rule for Gaussian give the posterior over the latent as:
p(z∣y)=N(z∣μz∣y,Σz∣y)
Σz∣y−1=Σz−1+W⊤Σy−1W
μz∣y=Σz∣y[W⊤Σy−1(y−b)+Σz−1μz]
Exemple 1: Inferring an unknown scalar
We draw N noisy observations yi from a latent random variable z. Let us assume the measurement noise has fixed precision λy=σ21, so the likelihood is:
p(yi∣z)=N(z,λy−1)
The Gaussian prior of the source is
p(z)=N(z∣μ0,λ0−1)
Let also be
y=(y1,...,yN)
W=1N column vector of 1
Σy−1=diag(λyI)
The posterior is:
p(z∣y)=N(μN,λN)
λN=λ0+Nλy
μN=λNNλyyˉ+λ0μ0
Exemple 2: Inferring an unknown vector
We have an unknown quantity of interest z∈RD with a Gaussian prior p(z)=N(μz,Σz). If we know nothing about z, we can set Σz=∞I, and by symmetry μz=0
We make N noisy and independent measures yn∼N(z,Σy).
The likelihood is:
p(D∣z)=∏n=1NN(yn∣z,Σy)=N(yˉ∣z,N1Σy)
(we can replace the N observations by their average, provided that we scale down the covariance). Setting W=I and b=0 we apply the Bayes rule for Gaussian:
p(z∣y1,...,yN)=N(z∣μN,ΣN)
ΣN−1=Σz−1+NΣy−1
μN−1=ΣN−1(Σy−1Nyˉ+Σz−1μz)
Exemple 3: Sensor fusion
N measurements from M sensors, the model has the form:
p(z,y)=p(z)∏n∏mN(yn,m∣z,Σm)
Our goal is to combine the evidence together to compute the posterior.
Let suppose M=2, we can combine y1 and y2 into y=[y1,y2], so that:
p(y∣z)=N(y∣Wz,Σy)
with Σy=[Σ100Σ2] and W=[I,I]
We can then apply Bayes rule for Gaussian to get p(z∣y), with y=[yˉ1,yˉ2]
3.4 The exponential family
p(y∣ϕ)=h(y)ef(ϕ)⊤T(y)−A(f(ϕ))
With
T(y) the sufficient statistics
f(ϕ)=μ the canonical parameters
A(f(ϕ)) is the log partition function
canonical form: f(ϕ)=ϕ
natural exponential family (NEF): T(y)=y
This becomes:
p(y∣ϕ)=h(y)eϕ⊤y−A(ϕ)
The first and second cumulants of a distribution are E[Y] and V[Y]
This first and second moments are E[Y] and E[Y2]
Important property of the exponential family: derivatives of the log partition function yields the cumulants of the sufficient statistics
∇A(ϕ)=E[T(y)]
∇2A(ϕ)=Cov[T(y)]
Since the Hessian is pos-def and the log likelihood has the form ϕ⊤T(y)−A(ϕ), it is concave and hence the MLE has a unique global max.
3.5 Mixture models
k components, the distribution has the form:
p(y∣θ)=∑k=1Kπkpk(y)
with ∑k=1Kπk=1
This model can be expressed as a hierarchical model in which we introduce a latent variable z={1,...,K} which specificies which distribution to use to model y
The prior on this latent variable is p(z=k∣θ)=πk
μdk is the probability that bit d turns on in cluster k
3.6 Probabilistic Graphical Models
When the model is a Direct Acyclic Graph (DAG) we call it a Bayesian Network (even though nothing inherently Bayesian about it)
Ordered Markov Property: node given parents are independent of predecessors
Yi⊥Ypred(i)∣Yparents(i)
The joint distribution can be represented as:
p(Yi:V)=∏i=1Vp(Yi∣Yparents(i))
Markov Chain or autoregressive model of order 1, the p function is the transition function or Markov kernel
p(y1:T)=p(y1)∏t=2Tp(yt∣yt−1)
Generalization to m-order:
p(y1:T)=p(y1:M)∏t=M+1Tp(yt∣yt−M:t−1)
It the parameters of the Conditional Probability Distribution (CPD) are unknown, we can view them as additional random variables and threat them as hidden variables to be inferred.
θ∼p(θ) some unspecified prior over the parameters