Proba ML
8. Optimization
8.5 Constrained Optimization

8.5 Constrained Optimization

We consider the following problem:

θ=arg minθCL(θ)\theta^*=\argmin_{\theta \in \mathcal{C} }\mathcal{L}(\theta)

where the feasible set is:

C={θRD:hi(θ)=0,iE, gj(θ)0,jI}\mathcal{C}= \{\theta \in \mathbb{R}^D:h_i(\theta)=0, i \in \mathcal{E},\space g_j(\theta)\leq 0,j\in \mathcal{I}\}

with E\mathcal{E} is the set of equality constraints and I\mathcal{I} the set of inequality constraints.

8.5.1 Lagrange Multipliers

Let’s have a single constrained h(θ)=0h(\theta)=0.

We know that the gradient of the loss and the gradient of the constraint are (anti-?)parallel at some point θ\theta^* minimizing L(θ)\mathcal{L}(\theta).

  • proof

    If we move by ϵ\epsilon on the constraint surface we have:

    h(θ+ϵ)h(θ)+ϵh(θ)h(\theta+\epsilon) \approx h(\theta)+\epsilon^\top \nabla h(\theta)

    As we have h(θ+ϵ)=h(θ)=0h(\theta+\epsilon)=h(\theta)=0, we must have ϵh(θ)\epsilon \perp \nabla h(\theta).

    We seek a point θ\theta^* on the constrained surface minimizing L(θ)\mathcal{L}(\theta). L(θ)\nabla \mathcal{L}(\theta) must also be orthogonal to the constraint surface, otherwise we could decrease the objective by just moving on the surface.


h(θ)=λL(θ)h(\theta^*)=\lambda \mathcal{L}(\theta^*)

for some λR\lambda\in \mathbb{R}.

This leads to the objective:

L(θ,λ)L(θ)+λh(θ)L(\theta,\lambda) \triangleq \mathcal{L}(\theta)+\lambda h(\theta)

At a stationary point, we have:

θ,λL(θ,λ)=0    λθh(θ)=L(θ),h(θ)=0\nabla_{\theta,\lambda}L(\theta,\lambda)=0 \iff \lambda\nabla_\theta h(\theta)=\nabla \mathcal{L(\theta)},h(\theta)=0

This is called a critical point.

If we have m>1m>1 constraints, we can add them:

L(θ,λ)=L(θ)+i=1mλihi(θ)L(\theta,\lambda)=\mathcal{L}(\theta)+\sum_{i=1}^m \lambda_i h_i(\theta)

We end up with D+mD+m equations for D+mD+m variables and can use unconstrained optimization methods to solve this system.

Exemple

We minimize L(θ1,θ2)=θ12+θ22\mathcal{L}(\theta_1,\theta_2)=\theta_1^2+\theta_2^2, subject to the constraint θ1+θ2=1\theta_1+\theta_2=1.

The Lagrangian is:

L(θ1,θ2,λ)=θ12+θ22+λ(θ1+θ21)L(\theta_1,\theta_2,\lambda)= \theta_1^2+\theta_2^2+\lambda(\theta_1+\theta_2-1)

Therefore we need to solve:

L(θ1,θ2,λ)θ1=2θ1+λ=0L(θ1,θ2,λ)θ2=2θ2+λ=0L(θ1,θ2,λ)λ=θ1+θ21=0\begin{align} \frac{\partial L(\theta_1,\theta_2,\lambda)}{\partial \theta_1} &=2\theta_1+ \lambda =0 \\ \frac{\partial L(\theta_1,\theta_2,\lambda)}{\partial \theta_2} &=2\theta_2+ \lambda =0 \\ \frac{\partial L(\theta_1,\theta_2,\lambda)}{\partial \lambda} &= \theta_1+\theta_2-1=0 \end{align}

8.5.2 The KKT conditions

We introduce inequality constraints g(θ)0g(\theta)\leq0:

L(θ)=L(θ)+I(g(θ)>0)L(\theta)=\mathcal{L}(\theta)+\infin I(g(\theta)>0)

However, this is a discontinuous function that is hard to optimize.

Instead, we can consider μ0:\mu\geq0:

L(θ)=L(θ)+μg(θ)L(\theta)=\mathcal{L}(\theta)+\mu g(\theta)

and our optimization problem becomes:

minθmaxμ0L(θ,μ)\min_\theta \max_{\mu \geq 0} L(\theta,\mu)

The general Lagrangian becomes:

L(θ,λ,μ)=L(θ)+i=1mλihi(θ)+j=1nμjgj(θ)L(\theta,\lambda,\mu)=\mathcal{L}(\theta)+\sum_{i=1}^m \lambda_i h_i(\theta) + \sum_{j=1}^n \mu_j g_j(\theta)

When L\mathcal{L} and gg are convex, then all critical points must satisfy the Karush-Kuhn-Tucker (KKT) conditions:

  • All constraints are satisfied (feasibility)

    g(θ)0,h(θ)=0g(\theta)\leq 0,h(\theta)=0
  • The solution is a stationary point:

    L(θ)+i=1mλihi(θ)+j=1nμjgj(θ)=0\nabla \mathcal{L}(\theta^*)+\sum_{i=1}^m \lambda_i \nabla h_i(\theta) + \sum_{j=1}^n \mu_j \nabla g_j(\theta)=0
  • The penalty for the inequality constraints points in the right direction (dual feasibility):

    μ0\mu \geq 0
  • The Lagrange multiplies pick up any slack in the inequality constraints:

μg=0\mu \odot g=0

If the L\mathcal{L} is convex and the constraints define a convex set, then the KKT conditions are sufficient to ensure global optimality.

8.5.3 Linear Programming

We optimize a linear function subject to linear constraints, represented as:

minθcθs.t.Aθb, θ0\min_\theta c^\top \theta \quad s.t. \quad A\theta \leq b,\space\theta \geq 0

The feasible set defines a convex polytope: a convex space defined as an intersection of half-spaces.

Screen Shot 2023-05-30 at 10.42.59.png

8.5.3.1 The simplex algorithm

It can be shown that the optima of an LP occur at the vertex of the polytope defining the feasible set.

The simplex algorithm solves LPs by moving from vertex to vertex, seeking the edge that most improve the objective.

8.5.3.2 Applications

Applications range from business to machine learning, in robust linear regression settings. It is also useful for state estimation in graphical models.

8.5.4 Quadratic programming

We optimize a quadratic function:

minθ12θHθ+cθ+ctes.t.Aθb,Aeqθ=beq\min_\theta \frac{1}{2}\theta^\top H\theta +c^\top \theta +cte \quad s.t. \quad A\theta\leq b, A_{eq}\theta=b_{eq}

If HH is semi-positive, then this is a convex problem.

Example

Let’s consider:

L(θ)=(θ132)2+(θ218)2=12θHθ+θc+cte\mathcal{L}(\theta)=(\theta_1-\frac{3}{2})^2+(\theta_2-\frac{1}{8})^2=\frac{1}{2}\theta^\top H \theta + \theta^\top c +cte

with H=2IH=2I and c=(3,1/4)c=-(3,1/4)

Subject to θ1+θ21|\theta_1|+|\theta_2|\leq 1

We can decompose the above constraint into:

θ1+θ21θ1θ21θ1+θ21θ1θ21\begin{align} \theta_1+\theta_2 &\leq 1 \\ \theta_1-\theta_2 &\leq 1 \\ -\theta_1+\theta_2 &\leq 1 \\ -\theta_1-\theta_2 &\leq 1 \\ \end{align}

We can write as Aθb0A\theta -b \leq 0, with

A=[11111111]A=\begin{bmatrix} 1 & 1 \\ 1 & -1 \\ -1 & 1 \\ -1 & -1 \end{bmatrix}

This is now in the standard QP form.

Screen Shot 2023-05-24 at 09.51.53.png

We see that the 2 edges of the diamond to the left are inactive because the objective is in the opposite directions.

This means g3>0g_3>0 and g4>0g_4>0 and by complementarity μ3=μ4=0\mu_3^*=\mu_4^*=0

From the KKT conditions, we know that:

Hθ+c+Aμ=0H\theta +c+A^\top\mu=0

This gives us:

[2011021111001100][θ1θ2μ1μ2]=[31/411]\begin{bmatrix} 2 & 0 & 1 & 1\\ 0 & 2 & 1 & -1 \\ 1 & 1 & 0 & 0 \\ 1 & -1 & 0 & 0 \\ \end{bmatrix} \begin{bmatrix} \theta_1 \\ \theta_2 \\ \mu_1 \\ \mu_2 \\ \end{bmatrix}= \begin{bmatrix} 3 \\ 1/4 \\ 1 \\ 1 \\ \end{bmatrix}

The solution is θ=(1,0),μ=(0.625,0.375,0,0)\theta=(1, 0)^\top, \mu=(0.625, 0.375,0,0)^\top

8.5.4.1 Applications

QP programming is used in the lasso method for sparse linear regression, which optimizes:

L(β)=XβY22+λβ1\mathcal{L}(\beta)=||X\beta - Y ||^2_2+\lambda ||\beta||_1

It can also be used for SVM.