Proba ML
10. Logistic Regression
10.4 Robust Logistic Regression

When we have outliers in our data, due to label noise, robust logistic regression help avoid adversarial effects on the model.

10.4.1 Mixture model for the likelihood

One of the simplest ways to achieve robust logistic regression is to use a mixture likelihood:

p(yx)=πBer(y0.5)+(1π)p(yσ(wx))p(y|\bold{x})=\pi \mathrm{Ber}(y|0.5)+(1-\pi)p(y|\sigma(w^\top \bold{x}))

This predicts that each label is generated uniformly at random with a probability π\pi, and otherwise is generated using the regular conditional model.

This approach can also be applied to DNN and can be fit using standard methods like SGD or Bayesian inference methods like MCMC.

Screen Shot 2023-07-10 at 10.01.22.png

10.4.2 Bi-tempered loss

Examples far from the decision boundary but mislabeled will have undue adverse effects on the model if the loss is convex.

Tempered loss

This can be overcome by replacing the cross entropy loss with a “tempered” version, using a temperature parameter 0t110\leq t_1\leq 1 to ensure the loss from outliers is bounded.

The standard cross-entropy loss is:

L(y,y^)=H(y,y^)=cyclogy^c\mathcal{L}(\bold{y},\hat{\bold{y}})=\mathbb{H}(\bold{y},\hat{\bold{y}})=\sum_c y_c\log \hat{y}_c

The tempered cross-entropy is:

L(y,y^)=c[yc(logt1yclogt1y^c)12t1(yc2t1+yc^2t1)\mathcal{L}(\bold{y}, \hat{\bold{y}})=\sum_c [y_c (\log_{t_1} y_c-\log_{t_1} \hat{y}_c) - \frac{1}{2-t_1}(y_c^{2-t_1}+\hat{y_c}^{2-t_1})

when all the mass of y\bold{y} is on cc (one-hot encoding) this simplifies to:

L(c,y^)=logt1y^c12t1(1+c=1Cyc^2t1)\mathcal{L}(c, \hat{\bold{y}})=-\log_{t_1} \hat{y}_c - \frac{1}{2-t_1}(1+\sum_{c'=1}^C \hat{y_c}^{2-t_1})

Here logt\log_t is the tempered log:

logt(x)11t(x1t1)\log_t(x)\triangleq \frac{1}{1-t}(x^{1-t}-1)

which is monotonically increasing and concave, and reduces to the standard logarithm when t1t_1 is 1.

This is also bounded below by 11t-\frac{1}{1-t} for 0t1<10\leq t_1< 1, therefore the tempered cross-entropy is bounded above.

Transfer function

Observation near the decision boundary but mislabeled needs to use a transfer function RC[0,1]C\mathbb{R}^C\rightarrow[0,1]^C with a heavier tail than the softmax.

The standard softmax is:

y^c=eacc=1Ceac=exp[aclogc=1Ceac]=exp[acLSE(a)]\hat{y}_c=\frac{e^{a_c}}{\sum_{c'=1}^C e^{a_{c'}}}=\exp[a_c-\log \sum_{c'=1}^C e^{a_{c'}}]=\exp[a_c-\mathrm{LSE}(\bold{a})]

The tempered softmax, with t2>1>t1t_2>1>t_1 is:



expt(x)[1+(1t)x]+1/1t\exp_t(x)\triangleq [1+(1-t)x]^{1/1-t}_+

when t1t\rightarrow 1, we find back the standard softmax.

Finally, we need to compute λt2(a)\lambda_{t_2}(\bold{a}), this needs to satisfy:

c=1Cexpt2(acλt2(a))=1\sum_{c=1}^C \exp_{t_2}(a_c-\lambda_{t_2}(\bold{a}))=1

Screen Shot 2023-07-11 at 10.13.36.png

Screen Shot 2023-07-11 at 09.00.18.png

Combining the tempered loss with the tempered transfer function is bi-tempered logistic regression.

Screen Shot 2023-07-11 at 10.17.56.png