Proba ML
8. Optimization
8.7 Bound Optimization

8.7 Bound optimization

8.7.1 The general algorithm

Our goal is to maximize some function LL(θ)LL(\theta), such as the log-likelihood. We construct a tight lower-bound surrogate function QQ such that:

Q(θ,θt)LL(θ),Q(θt,θt)=LL(θt)Q(\theta,\theta^t)\leq LL(\theta),\quad Q(\theta^t,\theta^t)= LL(\theta^t)

If these conditions are met, we use the update:

θt+1=arg maxθQ(θ,θt)\theta^{t+1}=\argmax_\theta Q(\theta,\theta^t)

This guarantees the monotonic increases of the original objective:

LL(θt)=Q(θt,θt)Q(θt+1,θt)LL(θt+1)LL(\theta^t)=Q(\theta^t,\theta^t)\leq Q(\theta^{t+1},\theta^t)\leq LL(\theta^{t+1})

This iterative procedure will converge to a local minimum of the LLLL.

Screen Shot 2023-06-04 at 11.51.33.png

If QQ is quadratic, then the bound optimization is similar to Newton’s method, which repeatedly fits and optimizes a quadratic approximation.

The difference is that optimizing QQ is guaranteed to improve the objective, even if it is not convex, whereas Newton may overshoot or lead to a decrease in the objective since it is a quadratic approximation, not a bound.

Screen Shot 2023-06-04 at 11.55.38.png

8.7.2 The EM Algorithm

8.7.2.1 Lower bound

The goal of the EM is to maximize the log-likelihood:

LL(θ)=n=1Nlogp(ynθ)=n=1Nlog[znp(yn,znθ)]LL(\theta)=\sum_{n=1}^N \log p(y_n|\theta)=\sum^N_{n=1}\log \Big[\sum_{z_n} p(y_n,z_n|\theta)\Big]

where yny_n are the visible and znz_n the invisible variables.

To push the log\log into the sum, we consider arbitrary distributions qn(zn)q_n(z_n):

LL(θ)=n=1Nlog[znqn(zn)p(yn,znθ)qn(zn)]n=1Nznqnlogp(yn,znθ)qn(zn)=n=1NEqn[logp(yn,znθ)]+H(qn)=n=1NL(θ,qnyn)=L(θ,q{1:N})\begin{align} LL(\theta) &= \sum^N_{n=1}\log \Big[\sum_{z_n} q_n(z_n)\frac{p(y_n,z_n|\theta)}{q_n(z_n)}\Big] \\ &\geq \sum^N_{n=1} \sum_{z_n} q_n\log \frac{p(y_n,z_n|\theta)}{q_n(z_n)} \\ & = \sum_{n=1}^N \mathbb{E}_{q_n}[\log p(y_n,z_n|\theta)] + \mathbb{H}(q_n) \\ & = \sum_{n=1}^N \frak{L}(\theta,q_n|y_n) = \frak{L}(\theta,q_{\{1:N\}}) \end{align}

Eq 2. used Jensen’s inequality since log\log is a concave function and znqn=1\sum_{z_n} q_n=1

L(θ,q{1:N})\frak{L}(\theta,q_{\{1:N\}}) is called Evidence Lower Bound (ELBO), optimizing it is the basis of variational inference.

8.7.2.2 E step

We estimate the hidden variables qnq_n

L(θ,qn)=znqn(zn)logp(yn,znθ)qn(zn)=znqn(zn)logp(znyn,θ)p(ynθ)qn(zn)=znqn(zn)logp(znyn,θ)qn(zn)+znqn(zn)logp(ynθ)=DKL(qn(zn)p(znyn,θ))+logp(ynθ)\begin{align} \frak{L}(\theta,q_n) &=\sum_{z_n}q_n(z_n) \log \frac{p(y_n,z_n|\theta)}{q_n(z_n)} \\ &= \sum_{z_n}q_n(z_n) \log \frac{p(z_n|y_n,\theta)p(y_n|\theta)}{q_n(z_n)} \\ &= \sum_{z_n} q_n(z_n)\log \frac{p(z_n|y_n,\theta)}{q_n(z_n)}+\sum_{z_n}q_n(z_n) \log p(y_n|\theta) \\ &= -D_{KL}\big(q_n(z_n)||p(z_n|y_n,\theta)\big)+\log p(y_n|\theta) \end{align}

We can maximize this lower bound by setting qn=p(znyn,θ)q_n^*=p(z_n|y_n,\theta). This is called the E step.

We can define:

Q(θ,θt)=L(θ,p(znyn,θt))Q(\theta,\theta^t)=\frak{L}\big(\theta,p(z_n|y_n,\theta^t)\big)

Then we have Q(θ,θt)LL(θ)Q(\theta,\theta^t)\leq LL(\theta) and Q(θt,θt)=LL(θt)Q(\theta^t,\theta^t)=LL(\theta^t) as required.

8.7.2.3 M Step

We compute the MLE by maximizing L(θ,q{1:N}t)\frak{L}(\theta,q_{\{1:N\}}^t) w.r.t θ\theta, where q1:Ntq_{1:N}^t are the distributions computed during the E step at iteration tt.

Since H(qn)\mathbb{H}(q_n) is constant w.r.t θ\theta, we can remove it. We are left with:

θt+1=arg maxθnEqnt[log(yn,znθ)]\theta^{t+1}=\argmax_\theta\sum_n \mathbb{E}_{q_n^t}[\log(y_n,z_n|\theta)]

Screen Shot 2023-06-04 at 13.00.28.png