4.2 Maximum likelihood estimation (MLE)
4.2.1 Definition
Pick the parameter estimation assigning the highest probability to the training data, defined as:
Wit the i.i.d. assumptions this becomes:
And the Negative Log Likelihood (since most optimization algorithms are designed to minimize cost functions)
With
4.2.2 Justification for MLE
MLE can be viewed as a point estimation to the Bayesian posterior
So if the prior is uniform.
Another way to see the MLE is that the resulting predictive distribution is as close as possible to the empirical distribution of the data.
If we defined the empirical distribution by
Minimizing the KL divergence between the empirical and an estimated distribution is equivalent to minimizing the NLL and therefore computing the MLE.
The same logic applies for supervised settings, with:
4.2.3 MLE for the Bernoulli distribution
Let be the probability of heads in a coin toss.
The MLE can be found by:
4.2.4 MLE for the categorical distribution
Let .
To compute the MLE, we have to minimize the NLL subject the constraint using the following Lagrangian:
We get the MLE by taking the derivative of for and
4.2.5 MLE for the univariate Gaussian
Suppose . We estimate here again the parameters using the MLE:
We find the stationary point by and
and are the sufficient statistics of the data, since they are sufficient to compute the MLE.
The unbiased estimator for variance (not the MLE) is:
4.2.6 MLE for MVN
with the precision matrix.
Using :
Hence
So the MLE of is just the empirical mean.
Using the trace trick:
With the scatter matrix centered on
Resolving the derivative for gives
4.2.7 MLE for linear regression
Let suppose the model corresponds to . If we fix to focus on :
Dropping the irrelevant additive constants:
Note that and
Writing the in matrix notation:
And the equation of OLS is: