Proba ML
8. Optimization
8.1 Intro

8.1 Intro

Optimization is at the core of machine learning: we try to find the values of a set of variable θΘ\theta \in \Theta that minimizes a scalar-valued loss function L:ΘR\mathcal{L}:\Theta\rightarrow \mathbb{R}:

θ=arg minθΘL(θ)\theta^*=\argmin_{\theta\in\Theta}\mathcal{L}(\theta)

with the parameter space ΘRD\Theta\subset\mathbb{R}^D.

An algorithm minimizing this loss function is called a solver.

8.1.1 Local vs Global Optimization

Screen Shot 2023-04-06 at 23.02.06.png

The global optimum is the point that satisfies the general equation, but finding it is computationally intractable.

Instead, we will try to find a local minimum. We say that θ\theta^* is a local minimum if:

ϵ>0, θΘ  s.t.  θθ<ϵ, L(θ)L(θ)\exist \,\epsilon>0\,,\space\forall \,\theta\in\Theta\;s.t.\;||\theta-\theta^*||<\epsilon,\space\mathcal{L(\theta^*)}\leq\mathcal{L}(\theta)

A local minimum can be flat (multiple points of equal objective value) or strict.

A globally convergent algorithm will converge to a stationary point, although that point doesn’t have to be the global optimum.

Let g(θ)=L(θ)g(\theta)=\nabla \mathcal{L}(\theta) and H(θ)=2L(θ)H(\theta)=\nabla^2\mathcal{L(\theta)} hessian matrix.

  • Necessary condition: θ\theta^* is a local minimum     \implies g=0\bold{g}^*=\bold{0} and HH^* is positive semi-definite
  • Sufficient condition: g=0\bold{g}^*=\bold{0} and HH^* positive definite     \implies θ\theta^* is a local minimum

The condition on HH^* is necessary to avoid saddle points —in this case, the eigenvalues of the Hessian will be both positive and negative, hence non-positive definite.

8.1.2 Constrained vs Unconstrained Optimization

In unconstrained optimization, we minimize the loss with any value in the parameter space Θ\Theta.

We can represent inequality constraints as gj(θ)0g_j(\theta)\leq 0 for jIj\in \mathcal{I} and equality constraints as hk(θ)=0h_k(\theta)=0 for kEk\in \mathcal{E}

The feasible set is then:

C={θ:gj(θ)0:jI,hk(θ)=0:kE}RD\mathcal{C}=\{\theta:g_j(\theta)\leq 0:j\in\mathcal{I},h_k(\theta)=0:k\in\mathcal{E}\}\subseteq \mathbb{R}^D

And the constraint optimization problem becomes:

θ=arg minθCL(θ)\theta^*=\argmin_{\theta\in \mathcal{C}}\mathcal{L}(\theta)

Adding constraints can create new minima or maxima until the feasible set becomes empty.

Screen Shot 2023-04-07 at 09.22.26.png

A common strategy for solving constrained optimization is to combine penalty terms, measuring how much we violated the constraints, with the unconstrained objective. The Lagrangian is a special case of this combined objective.

8.1.3 Convex vs Nonconvex Optimization

In convex optimization, every local minimum is a global minimum, therefore many machine learning models are designed so that their training objective is convex.

S\mathcal{S} is a convex set if for any x,xSx, x' \in \mathcal{S} and λ[0,1]\lambda\in[0, 1] we have:

λx+(1λ)xS\lambda x+(1-\lambda)x' \in \mathcal{S}

Screen Shot 2023-04-07 at 09.22.29.png

We say ff is a convex function if its epigraph (the set of points above the function) defines a convex set.

Screen Shot 2023-04-07 at 09.34.47.png

Equivalently, f(x)f(\bold{x}) is convex if it is defined on a convex set, and for any x,yS\bold{x, y} \in \mathcal{S} and λ[0,1]\lambda\in[0,1]:

f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda \bold{x}+(1-\lambda)\bold{y})\leq \lambda f(\bold{x})+(1-\lambda)f(\bold{y})

Screen Shot 2023-04-07 at 09.35.00.png

f(x)f(\bold{x}) is concave if f(x)-f(\bold{x}) is convex.

Suppose f:RnRf:\mathbb{R}^n \rightarrow \mathbb{R} is twice differentiable, then:

  • ff is convex iff H=2f(x)H=\nabla^2f(\bold{x}) is positive semi definite.

  • ff is strictly convex iff H=2f(x)H=\nabla^2f(\bold{x}) is positive definite

For example, let’s consider the quadratic form f(x)=xAxf(\bold{x})=\bold{x}^\top A\bold{x}. It is convex if AA is positive definite.

Screen Shot 2023-04-07 at 09.42.25.png

Furthermore, ff is strongly convex with parameter m>0m>0 if for any x,y\bold{x, y} in its subspace:

(f(x)f(y))(xy)mxy2(\nabla f(\bold{x})-\nabla f(\bold{y}))^\top(\bold{x}-\bold{y})\geq m ||\bold{x}-\bold{y}||^2

If ff is twice differentiable, then it is strongly convex with parameter mm iff 2f(x)mI\nabla^2f(\bold{x}) \succeq mI —meaning 2f(x)mI\nabla^2f(\bold{x}) -mI is positive definite.

In the real line domain, this condition becomes f(x)mf''(x)\geq m.

In conclusion:

  • ff is convex iff f(x)0f''(x)\geq0
  • ff is strictly convex if f(x)>0f''(x)>0
  • ff is strongly convex with parameter mm iff f(x)m>0f''(x)\geq m>0

8.1.4 Smooth vs Non-smooth Optimization

In smooth optimization, the objective function is continuously differentiable, and we quantify the smoothness using the Lipschitz constant. In the 1d case:

f(x2)f(x1)Lx1x2|f(x_2)-f(x_1)|\leq L |x_1-x_2|

The function output can’t change by more than LL if we change the input by 11 unit.

Screen Shot 2023-04-09 at 11.20.58.png

Non-smooth functions have at least some functions in the gradient that are not defined:

Screen Shot 2023-04-09 at 11.24.55.png

In some problems we can partition the objective into a part containing only smooth terms and a part with nonsmooth (rough) terms:

L(θ)=Ls(θ)+Lr(θ)\mathcal{L}(\theta)=\mathcal{L}_s(\theta)+\mathcal{L}_r(\theta)

In machine learning, these composite objectives are often formed with Ls(θ)\mathcal{L}_s(\theta) the training loss and Lr(θ)\mathcal{L}_r(\theta) a regularizer such as the 1\ell_1 norm of θ\theta.

8.1.4.1 Subgradients

We generalize the notion of derivative to function with local discontinuities.

For f:RnRf:\mathbb{R}^n\rightarrow \mathbb{R}, gRn\bold{g}\in\mathbb{R}^n is the subgradient of at xdom(f)\bold{x} \in \mathrm{dom}({f}):

f(z)f(x)+g(zx)f(\bold{z})\geq f(\bold{x})+\bold{g}^\top (\bold{z}-\bold{x})

Screen Shot 2023-04-09 at 11.34.59.png

For example, the subdifferential of f(x)=xf(x)=|x| is:

f(x)={{1}if x<0[1,1]if x=0{1}if x>0\partial f (x) = \begin{cases} \{-1\} & \mathrm{if}\space x<0 \\ [-1, 1] & \mathrm{if}\space x=0 \\ \{1\} & \mathrm{if}\space x>0 \end{cases}

Screen Shot 2023-04-09 at 11.35.11.png