Proba ML
10. Logistic Regression
10.2 Binary Logistic Regression

10.2 Binary Logistic Regression

Binary logistic regression corresponds to the following model:

p(yx,θ)=Ber(yσ(wx+b))p(y|\bold{x},\theta)=\mathrm{Ber}(y|\sigma(w^\top \bold{x} + b))

where θ=(w,b)\theta=(w,b) are the weights and bias. In other words:

p(yx,θ)=11+e(wx+b)p(y|\bold{x},\theta)=\frac{1}{1+e^{-(w^\top \bold{x}+b)}}

and we call the log odds:

a=wx+b=log(p1p)a=w^\top \bold{x} +b=\log (\frac{p}{1-p})

10.2.1 Linear classifiers

If the loss for misclassifying each class is the same, the class is given by:

f(x)=I(p(y=1x,θ)>p(y=0x,θ))=I(logp(y=1x,θ)p(y=0x,θ)>0)=I(a>0)\begin{align} f(\bold{x})&=\mathbb{I}\Big(p(y=1 | \bold{x},\theta) > p(y=0 | \bold{x},\theta)\Big) \\ &= \mathbb{I}\Big(\log \frac{p(y=1|\bold{x},\theta)}{p(y=0|\bold{x},\theta)}>0\Big) \\ &= \mathbb{I}(a>0) \end{align}

with:

f(x)=wx+bf(x)=w^\top \bold{x}+b

Screen Shot 2023-06-28 at 09.48.32.png

For xx and x0x_0 on the plan and ww the normal at x0x_0, we have:

w(xx0)=0wx+b=0w^\top(\bold{x}-\bold{x}_0)=0 \Rightarrow w^\top \bold{x}+b=0

With b=wx0b=-w^\top \bold{x}_0

This defines the decision boundary between both classes.

10.2.2 Nonlinear classifiers

We can often make a problem linearly separable by preprocessing the input.

Let:

  • ϕ(x1,x2)=[1,x12,x22]\phi(x_1,x_2)=[1,x_1^2,x_2^2]
  • w=[R2,1,1]w=[-R^2,1,1]

Then:

wϕ(x)=x12+x22R2=0w^\top \phi(\bold{x})=x_1^2+x_2^2-R^2 =0

which is the decision boundary of a circle.

Screen Shot 2023-06-28 at 09.54.02.png

The resulting function is still linear, which simplifies the learning process, and we gain power by learning the parameters of ϕ(x)\phi(\bold{x}). We can also use polynomial expansion up to the degree KK, the model becomes more complex and risk overfitting.

10.2.3 Maximum Likelihood Estimation

10.2.3.1 Objective function

NLL(w)=1Nlogp(Dw)=1NlognBer(ynμn)=1Nnynlogμn+(1yn)log(1μn)=1NnH(yn,μn)\begin{align} \mathrm{NLL}(w)&=-\frac{1}{N}\log p(\mathcal{D}|w)=-\frac{1}{N} \log \prod_n \mathrm{Ber}(y_n|\mu_n) \\ &= -\frac{1}{N} \sum_n y_n \log \mu_n + (1 - y_n) \log (1 - \mu_n) \\ &= -\frac{1}{N} \sum_n \mathbb{H}(y_n,\mu_n) \end{align}

where μn=σ(wxn)\mu_n=\sigma(w^\top x_n) is the probability of class 1

10.2.3.3 Deriving the gradient

wNLL(w)=1Nnynwμnμn(1yn)wμn(1μn)\nabla_w \mathrm{NLL}(w)=-\frac{1}{N}\sum_n y_n \frac{\nabla_w \mu_n}{\mu_n}-(1-y_n)\frac{\nabla_w \mu_n}{(1-\mu_n)}

Then, we compute:

μnwd=μnananwd=σ(an)σ(1an)xnd=μn(1μn)xnd\begin{align} \frac{\partial \mu_n}{\partial w_d}=\frac{\partial \mu_n} {\partial a_n} \frac{\partial a_n} {\partial w_d}&=\sigma(a_n)\sigma(1-a_n)x_{nd}\\&=\mu_n(1-\mu_n)x_{nd} \end{align}

We plug it back into the previous equation:

wNLL(w)=1Nnyn(1μn)xn(1yn)μnxn=1Nn(ynμn)xn=1N(1N(diag(μy)X))\begin{align} \nabla_w \mathrm{NLL}(w)&=-\frac{1}{N}\sum_n y_n (1-\mu_n)x_n-(1-y_n)\mu_n x_n \\ &= -\frac{1}{N}\sum_n (y_n - \mu_n)x_n \\ &= \frac{1}{N}(1_N^\top(\mathrm{diag}(\mu-y)X))^\top \end{align}

10.2.3.4 Deriving the Hessian

Gradient-based optimizers will find a stationary point where g(w)=0g(w)=0, either a global or local optimum.

To be sure the optimum is global we must show that the objective is convex, by showing that the Hessian is semi-positive definite:

H(w)=wwNLL(w)=1Ni=1N(μn(1μn)xn)xn=XSX\mathrm{H}(w)= \nabla_w\nabla_w^\top \mathrm{NLL}(w)=\frac{1}{N}\sum_{i=1}^N (\mu_n(1-\mu_n)x_n)x_n^\top=X^\top SX

With Sdiag(μ1(1μ1),,μN(1μN))S\triangleq \mathrm{diag}(\mu_1(1-\mu_1),\dots,\mu_N(1-\mu_N))

We see that H\mathrm{H} is positive definite since for all non-zero vector vv we have:

(vXS1/2)(S1/2Xv)=vXS1/222>0(v^\top X^\top S^{1/2})(S^{1/2}Xv)=||v^\top X^\top S^{1/2}||^2_2>0

This follows since μn>0\mu_n>0 for all nn, because of the sigmoid function.

Consequently, the NLL is strictly convex, however in practice, values of μn\mu_n close to 0 and 1 might cause the Hessian to be singular. We can avoid this by using 2\ell_2 regularization.

10.2.4 Stochastic gradient descent

We can use SGD to solve the equation:

w^=arg minwL(w)\hat{w}=\argmin_w \mathcal{L}(w)

where the loss here is the NLL.

If we use a mini-batch of size 1, we have:

wt+t=wtηtwNLL(wt)=wtηt(μnyn)xn\begin{align} w_{t+t}&=w_t-\eta_t \nabla_w\mathrm{NLL}(w_t) \\ &= w_t-\eta_t (\mu_n-y_n)x_n \end{align}

Since we know the objective is convex we know this will converge to the global optimum, provided we decay the learning rate appropriately.

10.2.5 Perceptron Algorithm

A perceptron is a binary classifier:

f(xn,θ)=I(wxn+b>0)f(x_n,\theta)=\mathbb{I}(w^\top x_n+b >0)

This can be seen as a modified logistic regression, where the sigmoid is replaced by a Heaviside function. Since the Heaviside is not differentiable, we can’t use gradient-based optimization to fit the model.

Rosenblatt proposed a perceptron learning algorithm, where we only make updates for wrong predictions (we replaced soft probabilities with hard labels):

wt+1=wtηt(y^nyn)xnw_{t+1}=w_t-\eta_t (\hat{y}_n-y_n)x_n
  • If y^n=0\hat{y}_n=0 and yn=1y_n=1, the update is wt+xnw_t+x_n
  • If y^n=1\hat{y}_n=1 and yn=0y_n=0, the update is wtxnw_t-x_n

The advantage of Perceptron is that we don’t need to compute probabilities, which is useful when the label space is vast.

The disadvantage is that this method will only converge if the data is linearly separable, whereas SGD for minimizing the NLL will always converge to the globally optimal MLE.

In section 13.2, we will generalize perceptrons to nonlinear functions, enhancing their usefulness.

10.2.6 Iteratively weighted least square

Being a first-order method, SGD can be slow, especially when some directions point steeply downhill, while others have a shallow gradient.

In this situation, it can be much faster to use second-order methods that take the curvature of the space into account. We focus on the full batch setting (with NN small) since it is harder to make second-order methods work in the stochastic setting.

The hessian is assumed to be positive-definite to ensure the update is well-defined.

Ht=2L(wt)=H(wt)\mathrm{H}_t= \nabla^2\mathcal{L}(w_t)=\mathrm{H}(w_t)

If the hessian is exact, the learning rate can be set to 1:

wt+1=wtHt1gt=wt(XStX)1X(yμt)=(XStX)1XStzt\begin{align} w_{t+1}&=w_t- \mathrm{H}_t ^{-1}g_t \\ &= w_t-(X^\top S_tX)^{-1}X^\top(y-\mu_t) \\ &= (X^\top S_t X)^{-1}X^\top S_t z_t \end{align}

with:

ztXwt+St1(yμt)z_t\triangleq X w_t+S^{-1}_t(y-\mu_t)

and:

St=diag(μt,n(1μt,n))S_t=\mathrm{diag}(\mu_{t,n}(1-\mu_{t,n}))

Since StS_t is diagonal, we can rewrite:

zt,n=wtxn+ynμt,nμt,n(1μt,n)z_{t,n}=w_{t}^\top x_n+\frac{y_n-\mu_{t,n}}{\mu_{t,n}(1-\mu_{t,n})}

The optimal weight ww^* is a minimizer of:

n=1NSt,n(zt,nwtxn)2\sum_{n=1}^NS_{t,n}(z_{t,n}-w_t^\top x_n)^2

Screen Shot 2023-07-06 at 08.53.43.png

Fisher scoring is the same as IRLS except we replace the Hessian with its expectation i.e. we use the Fisher information matrix. Since the Fisher information matrix is independent of the data, it can be precomputed. This can be faster for problems with many parameters.

10.2.7 MAP estimation

Logistic regression can overfit when the number of parameters is high compared to the number of samples. To get this wiggly behavior, the weight values are large (up to +/-50 compared to 0.5 when K=1).

One way to reduce overfitting is to prevent weights from being so large, is performing MAP estimation using a zero mean Gaussian prior:

p(w)=N(w0,CI)p(w)=\mathcal{N}(w|0,C\mathrm{I})

The new objectives and gradient takes the form:

PNNL(w)=NNL(w)+λwwwPNNL(w)=g(w)+2λww2PNNL(w)=H(w)+2λI\begin{align} \mathrm{PNNL}(w)&=\mathrm{NNL}(w)+\lambda w^\top w \\ \nabla_w\mathrm{PNNL}(w)&=g(w)+2\lambda w \\ \nabla_w^2\mathrm{PNNL}(w) &=\mathrm{H}(w)+2\lambda \mathrm{I} \end{align}

10.2.8 Standardization

Using the previous isotropic prior N(w0,CI)\mathcal{N}(w|0,C\mathrm{I}) assumes that all weights are of the same magnitudes, which assumes features are of the same magnitudes as well.

To enforce that hypothesis, we can use standardization:

standardize(xn,d)=xn,dμ^dσ^dμ^d=1Nn=1Nxn,dσ^d2=1Nn=1N(xn,dμ^d)2\begin{align} \mathrm{standardize}(x_{n,d})&=\frac{x_{n,d}-\hat{\mu}_d}{\hat{\sigma}_d}\\ \hat{\mu}_d&=\frac{1}{N}\sum_{n=1}^N x_{n,d}\\ \hat{\sigma}^2_d &= \frac{1}{N}\sum_{n=1}^N(x_{n,d}-\hat{\mu}_d)^2 \end{align}

An alternative is min-max scaling, ensuring that all inputs lie within the interval [0,1][0, 1].