Proba ML
8. Optimization
8.2 First-Order Methods

8.2 First-Order Methods

First-order methods compute first-order derivatives of the objective function but they ignore the curvature information.

All these algorithms required the user to choose a starting point θ0\theta_0 before iterating:

θt+1=θt+ηtdt\theta_{t+1}=\theta_t +\eta_t \bold{d}_t

where ηt\eta_t is the learning rate and dt\bold{d_t} is the descent direction, such as the negative of the gradient gt=θL(θ)θt\bold{g}_t=\nabla_\theta \mathcal{L}(\theta)|_{\theta_t}.

These update steps are performed until reaching a stationary point.

8.2.2 Step size

The sequence of steps {ηt}\{\eta_t\} is the learning rate schedule. We describe some methods to pick this.

8.2.2.1 Constant step size

The simplest method is to consider ηt=η\eta_t=\eta. However, if set too large the method will fail to converge, and if set too small, the method will converge very slowly.

Screen Shot 2023-04-10 at 11.30.17.png

We can derive a theoretical upper bound for the step size. For a quadratic objective function:

L(θ)=12θAθ+bθ+c\mathcal{L}(\theta)=\frac{1}{2}\theta^\top A \theta+\bold{b}^\top\theta+c

with AA positive semi-definite, the steepest descent will converge iff the step size satisfies:

η<2λmax(A)\eta <\frac{2}{\lambda_{max}(A)}

This ensures that the step is lower than the slope of the steepest direction, which the eigenvalue measures.

More generally:

η<2L\eta <\frac{2}{L}

where LL is the Lipschitz constant of the gradient —but this constant is usually unknown.

8.2.2.2 Line search

The optimal step size can be found by solving the optimization:

ηt=arg minη>0L(θt+ηdt)\eta_t=\argmin_{\eta >0}\mathcal{L}(\theta_t+\eta \bold{d}_t)

If the loss is convex, this subproblem is also convex because the update term is affine.

If we take the quadratic loss as an example, we solve:

ddη[12(θ+ηd)A(θ+ηd)+b(θ+ηd)+c]\frac{d}{d \eta}\Big[\frac{1}{2}(\theta+\eta \bold{d})^\top A(\theta+\eta \bold{d}) +\bold{b}^\top(\theta+\eta \bold{d})+c\Big]

and find:

η=d(Aθ+b)dAd\eta=\frac{\bold{d}^\top(A \theta +b)}{\bold{d}^\top A\bold{d}}

However, it is often not necessary to be so precise; instead, we can start with the current step size and reduce it by a factor 0<β<10<\beta<1 at each step until it satisfies the Armijo-Goldstein test:

L(θt+ηdt)L(θt)+cηdtL(θt)\mathcal{L}(\theta_t+\eta\bold{d}_t)\leq \mathcal{L}(\theta_t)+c\eta \bold{d}_t^\top \nabla \mathcal{L}(\theta_t)

where c[0,1]c\in[0,1], typically c=104c=10^{-4}.

8.2.3 Convergence rates

In some convex problems where the gradient is bounded by the Lipschitz constant, gradient descent converges at a linear rate:

L(θt+1)L(θ)μL(θt)L(θ)||\mathcal{L}(\theta_{t+1})-\mathcal{L}(\theta_*)||\le \mu ||\mathcal{L}(\theta_t) - \mathcal{L}(\theta_*)||

where μ[0,1]\mu\in[0,1] is the convergence rate.

For simple problems like the quadratic objective function, we can derive μ\mu explicitly. If we use the steepest gradient descent with line search, the convergence rate is given by:

μ=(λmaxλminλmax+λmin)2\mu=\Big(\frac{\lambda_{\max}-\lambda_{\min}}{\lambda_{\max}+\lambda_{\min}}\Big)^2

Intuitively, the condition problem κ=λmaxλmin\kappa=\frac{\lambda_{\max}}{\lambda_{\min}} represents the skewness of the space —i.e. being far from a symmetrical bowl.

Screen Shot 2023-04-10 at 13.59.47.png

More generally, the objective around the optimum is locally quadratic, hence the convergence rate depends on the condition number of the Hessian at that point: κ(H)\kappa(H). We can improve the speed of convergence by using a surrogate model that has a Hessian close to the Hessian of the objective.

Line search often exhibits zig-zag paths that are inefficient, instead, we can use conjugate gradient descent.

8.2.3.1 Conjugate gradient descent

u,v\bold{u,v} are conjugate vectors w.r.t AA iff:

uAv=0\bold{u}^\top A\bold{v}=0

Let P={p1,,pn}P=\{\bold{p}_1,\dots,\bold{p}_n\} is a set of mutually conjugate vectors, i.e. piApj=0\bold{p}_i^\top A \bold{p}_j=0 for iji \neq j.

PP forms a basis for Rn\mathbb{R}^n, and we can write:

θ=i=1nηipi\theta_*=\sum_{i=1}^n \eta_i \bold{p}_i

The gradient of the quadratic objective function is:

L(θ)=Aθ+b\nabla \mathcal{L}(\theta)=A\theta+\bold{b}

thus we choose our first vector as:

p0=r0=(Aθ0+b)\bold{p}_0=\bold{r}_0=-(A\theta_0+\bold{b})

and we will find other vectors as conjugates with the following update:

ηt=rtrtptAptθt+1=θt+ηtptrt+1=rtηtAptexit if rt+1 small enoughβt=rt+1rt+1rtrtpt+1=rt+1+βtptt=t+1\begin{align} \eta_t &=\frac{\bold{r}_t^\top \bold{r}_t}{\bold{p}_t^\top A \bold{p}_t} \\ \theta_{t+1} &= \theta_t + \eta_t \bold{p}_t \\ \bold{r}_{t+1} &= \bold{r}_t -\eta_t A \bold{p}_t \\ &\mathrm{exit \space if \space \bold{r}_{t+1} \space small \space enough } \\ \beta_t &= \frac{\bold{r}_{t+1}^\top \bold{r}_{t+1}}{\bold{r}_t^\top \bold{r}_t} \\ \bold{p}_{t+1} &= \bold{r}_{t+1}+\beta_t \bold{p}_t \\ t &= t+1 \end{align}

8.2.4 Momentum methods

8.2.4.1 Momentum

This can be implemented as follow:

mt=βmt1+gt1θt=θt1μtmt\begin{align} \bold{m}_t &=\beta \bold{m}_{t-1} + \bold{g}_{t-1} \\ \theta_t &= \theta_{t-1}-\mu_t \bold{m}_t \end{align}

with β[0,1]\beta \in[0,1], usually β=0.9\beta=0.9.

Where the momentum mt\bold{m}_t plays the role of an exponentially weighted moving average of the gradient:

mt=k=0t1βkgtk1\bold{m}_t=\sum_{k=0}^{t-1} \beta^k\bold{g}_{t-k -1}

8.2.4.2 Nesterov momentum

The standard momentum issue is that it may not slow down enough at the bottom of the valley and causes oscillations.

To mitigate this, Nesterov accelerated gradient adds extrapolation steps to the gradient descent:

θ~t+1=θt+βt(θtθt1)θt+1=θ~t+1μtL(θ~t+1)\begin{align} \tilde{\theta}_{t+1} &=\theta_t+\beta_t(\theta_t-\theta_{t-1}) \\ \theta_{t+1} &= \tilde{\theta}_{t+1} - \mu_t \nabla \mathcal{L(}\tilde{\theta}_{t+1}) \end{align}

This can be rewritten in the same format as the momentum:

mt=βmt1μt1L(θt1+βmt1)θt=θt1+mt\begin{align} \bold{m}_t &=\beta\bold{m}_{t-1}-\mu_{t-1} \nabla \mathcal{L}(\theta_{t-1}+\beta \bold{m}_{t-1}) \\ \theta_t &= \theta_{t-1}+\bold{m}_t \end{align}

This method can be faster than standard momentum because it measures the gradient at the next location.

Screen Shot 2023-04-10 at 16.52.49.png

In practice, using the Nesterov momentum can be unstable if β\beta or μt\mu_t are misspecified.