Proba ML
8. Optimization
8.4 Stochastic Gradient Descent

8.4 Stochastic gradient descent

The goal is to minimize the average of the loss function:

L(θ)=Eq(z)[L(θ,z)]\mathcal{L}(\theta)=\mathbb{E}_{q(\bold{z})}[\mathcal{L}(\theta,\bold{z})]

where z\bold{z} is a training sample. In other words, at each iteration:

Lt(θ)=L(θ,zt)\mathcal{L}_t(\theta)=\mathcal{L(\theta,\bold{z}_t)}

where ztq\bold{z}_t\sim q.

The resulting algorithm is the SGD:

θt+1=θtμtL(θt,zt)=θtμtgt\theta_{t+1}=\theta_t-\mu_t \nabla \mathcal{L}(\theta_t, \bold{z}_t)=\theta_t-\mu_t \bold{g}_t

As long as the gradient estimate is unbiased, this will converge to a stationary point, providing a decay on μt\mu_t.

8.4.1 Finite sum problem

SGD is widely used in ML, because many models are based on empirical risk minimization, which minimizes the following loss:

L(θt)=1Nn=1NLn(θt)=1Nn=1N(yn,f(xn,θt))\mathcal{L}(\theta_t)=\frac{1}{N}\sum_{n=1}^N \mathcal{L}_n(\theta_t)=\frac{1}{N}\sum_{n=1}^N \ell (y_n, f(x_n,\theta_t))

This is called a finite-sum problem. Computing the gradient requires summing over all NN examples and can be slow. Fortunately, we can approximate it by sampling a minibatch BNB \ll N:

gt1BtnBtθLn(θt)\bold{g}_t \approx \frac{1}{|B_t|}\sum_{n\in B_t}\nabla_\theta \mathcal{L}_n(\theta_t)

where BtB_t is a random set of examples. Therefore, SGD is an unbiased estimate of the empirical average of the gradient.

Although the rate of convergence of SGD is slower than in batch GD, the per-step time is lower for SGD. Besides, it is most of the time wasteful to compute the complete gradient, above all at the beginning when the parameters are not well estimated.

8.4.2 Example using the LMS

The objective of the linear regression is:

L(θ)=12Nn=1N(xnθyn)2\mathcal{L}(\theta)=\frac{1}{2N}\sum_{n=1}^N (\bold{x}_n^\top \theta-y_n)^2

And its gradient:

gt=1Nn=1N(θtxnyn)xn\bold{g}_t=\frac{1}{N}\sum_{n=1}^N (\theta_t^\top \bold{x}_n - y_n)\bold{x}_n

And with SGD of batch size 11 the update is:

θt+1=θtμt(θtxnyn)xn\theta_{t+1} = \theta_t -\mu_t (\theta^\top_t \bold{x}_n-y_n)\bold{x}_n

with the index at each iteration n=n(t)n=n(t).

LMS and LGD might need multiple passes to the data to find the optimum.

Screen Shot 2023-04-17 at 08.39.19.png

8.4.3 Choosing the step size (learning rate)

An overall small learning rate leads to underfitting, and a large one leads to instability of the model: both fail to converge to the optimum.

Screen Shot 2023-04-17 at 08.46.58.png

Rather than using a single constant, we can use a learning rate schedule. A sufficient condition for SGD to converge is the schedule satisfying the Robbins-Monro condition:

μt0,t=1μt2t=1μt0\mu_t\rightarrow 0 ,\frac{\sum_{t=1}^\infin \mu_t^2}{\sum^\infin_{t=1} \mu_t}\rightarrow 0

Some common examples of learning rate schedules:

μt=μi  if titti+1 piecewise constantμt=μ0eλt exponential decayμt=μ0(βt+1)α polynomial decay\begin{align} \mu_t &=\mu_i \space\space \mathrm{if\space t_i \leq t\leq t_{i+1}} \space \mathrm{piecewise \space constant}\\ \mu_t&=\mu_0 e^{-\lambda t} \mathrm{\space exponential\space decay} \\ \mu_t &= \mu_0 (\beta t+1)^{-\alpha} \space \mathrm{polynomial \space decay} \end{align}

Screen Shot 2023-04-17 at 08.52.26.png

Piecewise constant schedule reduces the learning rate when a threshold is reached, or when the loss has plateaued, this is called reduce-on-plateau.

Exponential decay is typically too fast, and polynomial decay is preferred, here with β=1\beta=1 and α=0.5\alpha=0.5.

Another scheduling strategy is to quickly increase the learning rate before reducing it again: we hope to escape a poorly-conditioned loss landscape with small steps, before making progress with larger steps: this is the one-cycle learning rate schedule.

We can also repeat this pattern in a cyclical learning rate schedule to escape local minima.

Screen Shot 2023-04-19 at 09.07.26.png

8.4.4 Iterate averaging

The estimate of the parameter can be unstable with SGD. To reduce the variance, we can compute the average with:

θˉt=1ti=1tθi=1tθt+tt1θˉt1\bar{\theta}_t=\frac{1}{t}\sum^t_{i=1} \theta_i=\frac{1}{t}\theta_t+\frac{t}{t-1}\bar{\theta}_{t-1}

This is called iterate averaging or Polyak-Ruppert averaging. It has been proven that this method yields the best asymptotic convergence for SGD, matching second-order methods.

**Stochastic Weight Averaging (SWA) (opens in a new tab)** exploits the flatness in objective to find solutions that provide better generalization rather than quick convergence. It does so using an equal average with a modified learning rate schedule.

8.4.5 Variance reduction

These methods reduce the variance of the gradients in SGD, rather than the parameters themselves, and can improve the convergence rate from sublinear to linear.

8.4.5.1 Stochastic Variance Reduced Gradient (SVRG)

For each epoch, we take a snapshot of our parameters, θ~\tilde{\theta}, and compute its full-batch gradient L(θ~)\nabla \mathcal{L}(\tilde{\theta}). This will act as a baseline, and we also compute the gradient for this snapshot Lt(θ~)\nabla \mathcal{L}_t(\tilde{\theta}) at each minibatch iteration tt.

The gradient becomes:

gt=Lt(θt)Lt(θ~)+L(θ~)\bold{g}_t=\nabla \mathcal{L}_t(\theta_t)-\nabla \mathcal{L}_t(\tilde{\theta})+\nabla \mathcal{L}(\tilde{\theta})

This is unbiased since:

E[Lt(θ~)]=L(θ~)\mathbb{E}[\nabla\mathcal{L}_t(\tilde{\theta})]=\nabla \mathcal{L}(\tilde{\theta})

Iterations of SVRG are faster than those of full-batch GD, but SVRG can still match the theoretical convergence rate of GD.

8.4.5.2 Stochastic Average Gradient Accelerated

Unlike SVRG, SAGA only needs to compute the full batch gradient once at the start of the algorithm. It “pays” for this speed-up by keeping in memory NN gradient vectors.

We first initialize gnlocal=Ln(θ0)\bold{g}_n^{local}=\nabla \mathcal{L}_n(\theta_0) and then compute gavg=1Nn=1Ngnlocal\bold{g}^{avg}=\frac{1}{N}\sum_{n=1}^N \bold{g}_n^{local}.

Then at iteration tt, we use the gradient estimate:

gt=Ln(θt)gnlocal+gavg\bold{g}_t=\nabla \mathcal{L}_n(\theta_t)-\bold{g}_n^{local}+\bold{g}^{avg}

we then update gnlocal=Ln(θt)\bold{g}_n^{local}=\nabla \mathcal{L}_n(\theta_t) and gavg\bold{g}^{avg} by replacing the old local gradients by the new ones.

If the features (and hence the gradients) are sparse, then the cost of storing the gradients is reasonable.

8.4.5.3 Application to deep learning

Variance reduction techniques are widely used for fitting ML models with convex objectives but are challenging in deep learning. This is because batch normalization, data augmentation, and dropout break the assumptions of SVGR since the loss will differ randomly in ways that do not only depend on parameters and data index.

8.4.6 Preconditioned SGD

We consider the following update:

θt+1=θtμtMt1gt\theta_{t+1}=\theta_t-\mu_t M_t^{-1}\bold{g}_t

where MtM_t is a preconditioned matrix, positive definite. The noise in the gradient estimate makes the estimation of the Hessian difficult, on which second-order methods depend.

Also, solving the update direction with the full preconditioned matrix is expensive, therefore practitioners often use diagonal MtM_t.

8.4.6.1 Adaptative Gradient (AdaGrad)

AdaGrad was initially designed for convex objectives where many elements of the gradient vector are zero. These correspond to features that are rarely present in the input, like rare words.

The update has the form:

θt+1,d=θt,dμt1st,d+ϵgt,d\theta_{t+1,d}=\theta_{t,d}-\mu_t\frac{1}{\sqrt{s_{t,d}+\epsilon}}g_{t,d}

where d[1,D]d\in[1,D] is the index of the dimension of the parameters, ϵ\epsilon controls the amount of adaptivity and

st,d=i=1tgi,d2s_{t,d}=\sum_{i=1}^t g_{i,d}^2

Equivalently:

Δθt=μt1st+ϵgt\Delta \theta_t=-\mu_t \frac{1}{\sqrt{\bold{s}_{t}+\epsilon}}\bold{g_t}

where the root and divisions are computed element-wise.

Here we consider Mt=diag(st+ϵ)1/2M_t=\mathrm{diag}(\bold{s}_t+\epsilon)^{1/2}

We still have to choose the step size but AdaGrad is less sensitive to it than vanilla GD. We usually fix μt=μ0\mu_t=\mu_0.

8.4.6.2 RMSProp and AdaDelta

AdaGrad denominator can increase too quickly, leading to vanishing learning rates. An alternative is to use EWMA:

st+1,d=βst,d+(1β)gt,d2s_{t+1,d}= \beta s_{t,d}+(1-\beta)g_{t,d}^2

usually with β=0.9\beta =0.9, which approximates the RMS:

st,dRMS(g1:t,d)=1ti=1tgi,d2\sqrt{s_{t,d}} \approx \mathrm{RMS}(\bold{g}_{1:t,d})=\sqrt{\frac{1}{t}\sum_{i=1}^t g^2_{i,d}}

Therefore the update becomes:

Δθt=μt1st+ϵgt\Delta \theta_t = -\mu_t \frac{1}{\sqrt{\bold{s}_t+\epsilon}}\bold{g}_t

Adadelta is similar to RMSProp but adds an EWMA to the update of the parameter θt\theta_t as well:

Δθt=μtδt+ϵst+ϵgt\Delta \theta_t = -\mu_t \frac{\sqrt{\delta_t+\epsilon}}{\sqrt{\bold{s}_t+\epsilon}}\bold{g}_t

with:

δt=βδt1+(1β)(Δθt)2\delta_t=\beta \delta_{t-1}+(1-\beta)(\Delta \theta_t)^2

If we fix μt=μ0,\mu_t=\mu_0, as the adaptative learning rate is not guaranteed to decrease, the solution might not converge.

8.4.6.3 Adaptative Moment Estimation (Adam)

Let’s combine momentum with RMSProp:

mt=β1mt1+(1β1)gtst=β2st1+(1β2)gt2\begin{align} \bold{m}_t &=\beta_1\bold{m}_{t-1}+(1-\beta_1)\bold{g}_t \\ \bold{s_t} &= \beta_2 \bold{s}_{t-1}+(1-\beta_2)\bold{g}_t^2 \end{align}

this gives the following update:

Δθt=μt1st+ϵmt\Delta\theta_t=-\mu_t\frac{1}{\sqrt{\bold{s}_t+\epsilon}}\bold{m}_t

with β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, ϵ=106\epsilon=10^{-6} and μt=μ0=104\mu_t=\mu_0=10^{-4}

If we initialize s0=m0=0\bold{s}_0=\bold{m}_0=\bold{0}, we bias our update towards small gradient values. Instead, we can correct the bias using:

s^t=st1β1tm^t=mt1β2t\begin{align} \hat{\bold{s}}_t &=\frac{\bold{s}_t}{1-\beta_1^t} \\ \hat{\bold{m}}_t &=\frac{\bold{m}_t}{1-\beta_2^t} \end{align}

Screen Shot 2023-04-29 at 10.46.24.png

8.4.6.4 Issues with adaptative learning rates

Adaptative learning rates μ0Mt1\mu_0M_t^{-1} still rely on setting a μ0\mu_0. Since EWMA is computed using stochastic settings with noisy gradients, the optimization might not converge, even for convex problems, because the effective learning rate increases.

To mitigate this, AMSGrad, Padam, or Yogi (opens in a new tab) have been developed, the latter replacing:

st=β2st1+(1β2)gt2=st1+(1β2)(gt2st1)\bold{s}_t=\beta_2 \bold{s}_{t-1}+(1-\beta_2)\bold{g}^2_t=\bold{s}_{t-1}+(1-\beta_2)(\bold{g}_t^2-\bold{s}_{t-1})

with:

st=st1+(1β2)gt2sign(gt2st1)\bold{s}_t = \bold{s}_{t-1} + (1-\beta_2)\bold{g}_t^2 \odot \mathrm{sign}(\bold{g}_t^2-\bold{s}_{t-1})

Because ADAM uses EWMA which is by nature multiplicative, its adaptative learning rate decays fairly quickly. In sparse settings, gradients are rarely non-zero so ADAM loses information quickly.

In contrast, YOGI uses an additive adaptative method.

8.4.6.5 Non-diagonal preconditioning matrices

Preceding methods don’t solve the problem of ill-conditioning due to the correlation of the parameters, and hence do not always provide a speed boost over SGD as we might expect.

One way to get faster convergence is to use the full-matrix Adagrad:

Mt=[(GtGt)1/2+ϵID]1M_t=[(G_t G_t^\top)^{1/2}+\epsilon I_D]^{-1}

where Gt=[gt,,g1]RD×tG_t=[\bold{g_t},\dots,\bold{g_1}]\in \mathbb{R}^{D\times t}.

Unfortunately, MtRD×DM_t\in\mathbb{R}^{D\times D}, which is expensive to store and invert.

The Shampoo algorithm (opens in a new tab) makes a block diagonal approximation to M, one per layer of the model, and then uses the Kronecker product structure to invert it.

Screen Shot 2023-05-12 at 08.25.47.png

Screen Shot 2023-05-12 at 08.28.04.png

Screen Shot 2023-05-12 at 07.52.03.png

where G~t×i(Hti)1/2k\tilde{G}_t \times_i (H^i_t)^{-1/2k} is the tensor-matrix product. This can be expressed as:

 
if len(state) == 0:
    for dim_id, dim in enumerate(grad.size()):
        state[f"precond_{dim_id}"] = epsilon * torch.eye(dim)
 
for dim_id, dim in enumerate(grad.size()):
    precond = state[f"precond_{dim_id}"]
 
    grad = grad.transpose_(0, dim_id)
    transposed_size = grad.size()
    grad = grad.view(dim, -1)
    
    precond += grad @ grad.T  # (dim, dim)
    inv_precond = _matrix_power(precond, -1 / order) # uses SVD
    
    grad = inv_precond @ grad  # (dim, -1)
    grad = grad.view(transposed_size)

see full code (opens in a new tab).

However, Shampoo is impracticable in large applications and is difficult to parallelize. Ani+20 (opens in a new tab) introduces the following changes:

  • design a pipelined version of the optimization, exploiting the heterogeneity of the CPU-accelerators coupled architecture
  • extend Shampoo to train very large layers (such as embeddings layers)
  • replace the expensive SVD handling preconditioners with iterative methods to compute the roots of PSD matrices