Proba ML
4. Statistics
4.4 Other Estimation Methods

4.4 Other estimation methods

4.4.1 The methods of moment (MOM)

MOM is a simpler approach than computing MLE. Solve KK number of equations (KK is the number of parameters).

  • The theoretical moments are given by μk=E[Yk]\mu_k=\mathbb{E}[Y^k]
  • The empirical moments are given by μ^k=1Nnynk\hat{\mu}_k=\frac{1}{N}\sum_ny_n^k

We solve μk=μ^k\mu_k=\hat{\mu}_k for every k.

Example with the univariate Gaussian:

μ1=μ=yˉ\mu_1=\mu=\bar{y}

μ2=σ2+μ2=s2\mu_2=\sigma^2+\mu^2=s^2 (since σ2=s2μ2\sigma^2=s^2-\mu^2)

so:

μ^=yˉ\hat{\mu}=\bar{y}

σ^2=s2yˉ2\hat{\sigma}^2=s^2-\bar{y}^2

So in this case, this is similar to MLE.

Example with the uniform distribution:

p(yθ)=1θ2θ1I(θ1yθ2)p(y|\theta)=\frac{1}{\theta_2-\theta_1} \mathbb{I}(\theta_1\leq y\leq\theta_2)

μ1=E[Y]=θ1+θ22\mu_1=\mathbb{E}[Y]=\frac{\theta_1+\theta_2}{2}

μ2=E[Y2]=θ12+θ2θ1+θ123\mu_2=\mathbb{E}[Y^2]=\frac{\theta_1^2+\theta_2\theta_1+\theta_1^2}{3}

Inverting these equations to get θ1\theta_1 and θ2\theta_2, we see that these estimators sometimes give incorrect results.

To compute the MLE, we sort the data values. The likelihood is:

p(Dθ1,θ1)=(1θ2θ1)NI(θ1y1)I(yNθ2)p(\mathcal{D}|\theta_1, \theta_1)=(\frac{1}{\theta_2-\theta_1})^N \mathbb{I}(\theta_1\leq y_1)\mathbb{I}(y_N\leq\theta_2)

Then, if we set θ=θ2θ1\theta=\theta_2-\theta_1

ddθNLL(θ)=Nθ2θ1\frac{d}{d\theta} \mathrm{NLL}(\theta)=\frac{N}{\theta_2-\theta_1}

This is minimized by θ^1=y1\hat{\theta}_1=y_1 and θ^2=yN\hat{\theta}_2=y_N, as one would expect.

4.4.2 Online (recursive) estimation

When data arrives sequentially we perform online learning.

Let θ^t1\hat{\theta}_{t-1} our estimate (e.g. MLE) given D1:t1\mathcal{D}_{1:t-1}. To ensure our algorithms takes a constant time per update, our solution is in the form:

θ^t=f(θ^t1,ft)\hat{\theta}_t=f(\hat{\theta}_{t-1},f_t)

Example for the mean of Gaussian:

μ^t=n=1tyn=1t((t1)μ^t1+yt)=μ^t1+1t(ytμ^t1)\hat{\mu}_t=\sum_{n=1}^t y_n=\frac{1}{t}\Big((t-1)\hat{\mu}_{t-1}+y_t\Big)=\hat{\mu}_{t-1}+\frac{1}{t}(y_t-\hat{\mu}_{t-1})

This is a moving average, the size of the correction diminish over time. However if the distribution is changing, we might want to give more weight to recent data points.

This is solved by exponentially-weighted moving average (EWMA)

μ^t=βμ^t1+(1β)yt\hat{\mu}_t=\beta \hat{\mu}_{t—1}+(1-\beta)y_t

with 0<β<10<\beta<1

The contribution of a data point in k steps is βk(1β)\beta^k(1-\beta).

Since the initial estimate starts from μ^0=0\hat{\mu}_0=0 there is an initial bias, corrected by scaling as

μ~t=μ^t1βt\tilde{\mu}_t=\frac{\hat{\mu}_t}{1-\beta^t}

Screen Shot 2022-11-20 at 11.30.40.png