Proba ML
16. Examplar-Based Models
16.2 Kernel Density Estimation

16.3 Kernel density estimation (KDE)

KDE is a form of non-parametric density estimation. This also a form of generative model, since it defines a probability distribution p(x)p(\bold{x}) that can be evaluated pointwise, and which can be sampled to generate new data

16.3.1 Density kernels

Density kernels are function K:RR+\mathcal{K}:\R \rightarrow \R_+, such that K(x)dx=1\int \mathcal{K}(x)dx=1 and K(x)=K(x)\mathcal{K}(x)=\mathcal{K}(-x).

This latter symmetry property implies xK(x)dx=0\int x\mathcal{K}(x)dx=0 and hence:

xK(xxn)dx=xn\int x\mathcal{K}(x-x_n)dx = x_n

A simple example of such a kernel is the boxcar kernel, which is the uniform distribution around within the unit interval around the origin:

K(x)0.5I(x1)\mathcal{K}(x)\triangleq 0.5\mathbb{I}(|x|\le 1)

Another example is the Gaussian kernel:

K(x)12πex2/2\mathcal{K}(x)\triangleq \frac{1}{\sqrt{2\pi}}e^{-x^2/2}

We can control the width of the kernel by introducing a bandwidth parameter hh:

Kh(x)xhK(xh)\mathcal{K}_h(x)\triangleq \frac{x}{h}\mathcal{K}(\frac{x}{h})

We can generalize to vector valued inputs by defining a radial basis function (RBF) kernel:

Kh(x)Kh(x)\mathcal{K}_h(\bold{x})\propto \mathcal{K}_h(||\bold{x}||)

In the case of the Gaussian kernel this becomes:

Kh(x)=1(2π)D/2d=1Dexp(xd22h2)\mathcal{K}_h(\bold{x})=\frac{1}{(2\pi)^{D/2}}\prod_{d=1}^D \exp({-\frac{x_d^2}{2h^2}})

Although the Gaussian kernel is popular, it has unbounded support. Compact kernels can be faster to compute.

Screen Shot 2023-10-06 at 09.50.37.png

Screen Shot 2023-10-06 at 09.50.29.png

16.3.2 Parzen window density estimator

We now explain how to use kernels to define a nonparametric density estimate.

Recall the form of Gaussian mixture, with a fixed spherical Gaussian covariance and uniform mixture weights:

p(xθ)=1Kk=1Kp(xμk,σ2I)p(\bold{x}|\theta)=\frac{1}{K}\sum_{k=1}^K p(\bold{x}|\mu_k,\sigma^2 I)

One problem with this model is that it requires specifying the number KK of clusters and their positions μk\mu_k.

An alternative is to allocate one cluster center per data point:

p(xθ)=1Nn=1Np(xxn,σ2I)p(\bold{x}|\theta)=\frac{1}{N}\sum_{n=1}^N p(\bold{x}|\bold{x}_n,\sigma^2I)

This can be generalized to:

p(xθ)=1Nn=1NKh(xxn)p(\bold{x}|\theta)=\frac{1}{N}\sum_{n=1}^N \mathcal{K}_h(\bold{x}-\bold{x}_n)

This is called a Parzen window density estimator or kernel density estimator (KDE).

Its advantage over a parametric model is that no fitting is required (except for choosing hh) and there is no need to select the clusters.

The drawback is that it takes a lot of memory and a lot of time to evaluate.

Screen Shot 2023-10-06 at 09.58.42.png

The resulting model using the boxcar kernel simply count the number of points within a window of size hh. The Gaussian kernel gives a smoother density.

16.3.3 How to choose the bandwith parameter

The bandwidth parameter hh controls the complexity of the model.

In 1d data, if we assume the data has been generated from a Gaussian distribution, one can show the bandwidth minimizing the frequentist risk is given by:

h=σ(43N)1/5h=\sigma(\frac{4}{3N})^{1/5}

We can compute a robust approximation to the standard deviation by first computing the median absolute deviation (MAD):

MAD(x)=median(xmedian)\mathrm{MAD}(x)=\mathrm{median}(|x-\mathrm{median}|)

and then using σ^=1.4826MAD\hat{\sigma}=1.4826 \,\mathrm{MAD}.

If we have DD dimension, we can estimate each hdh_d separately for each dimension and then set:

h=(d=1Dhd)1/Dh=\Big(\prod_{d=1}^D h_d\Big)^{1/D}

16.3.4 From KDE to KNN classification

We previously discussed the KK neighbor classifier as a heuristic approach to classification. Interestingly, we can derive it as a generative classifier in which the class conditional densities p(xy=c)p(\bold{x}|y=c) are modeled using a KDE.

Rather than a fixed bandwidth a counting the points within a hypercube centered on a datapoint, we allow the bandwidth to be different for each point.

We “grow” a volume around x\bold{x} until we captured KK points, regardless of their class label. This is called a balloon density kernel estimator.

Let the resulting volume have size V(x)V(\bold{x}) (this was previously hD)h^D) and let there Nc(x)N_c(\bold{x}) example of the class cc in this volume. We can then estimate the class conditional density:

p(xy=c)=Nc(x)V(x)Ncp(\bold{x}|y=c)=\frac{N_c(\bold{x})}{V(\bold{x})N_c}

where NcN_c is the total number of point with class labels cc in the dataset.

If we take the class prior to be Nc/NN_c/N, we have the posterior:

p(y=cx,D)=Nc(x)V(x)NcNcNcNc(x)V(x)NcNcN=Nc(x)cNc=Nc(x)Kp(y=c|\bold{x},\mathcal{D})=\frac{\frac{N_c(\bold{x})}{V(\bold{x})N_c}\frac{N_c}{N}}{\sum_{c'}\frac{N_{c'}(\bold{x})}{V(\bold{x})N_{c'}}\frac{N_{c'}}{N}}=\frac{N_c(\bold{x})}{\sum_{c'} N_{c'}}=\frac{N_c(\bold{x})}{K}

16.3.5 Kernel regression

Just as KDE can be use for generative classifiers, it can also be used as generative models for regression.

16.3.5.1 Nadaraya-Watson estimator for the mean

In regression, our goal is to compute:

E[yx,D]=yp(yx,D)dy=yp(x,yD)dyp(x,yD)dy\mathbb{E}[y|x,\mathcal{D}]=\int y p(y|\bold{x},\mathcal{D})dy=\frac{\int yp(\bold{x},y|\mathcal{D})dy}{\int p(\bold{x},y|\mathcal{D})dy}

If we use a MVN for p(x,yD)p(\bold{x},y|\mathcal{D}), we derive a result which is equivalent to linear regression.

However, the Gaussian assumption on p(x,yD)p(\bold{x},y|\mathcal{D}) is rather limiting. We can use KDE to more accurately approximate this joint density:

p(x,yD)1Nn=1NKh(xxn)Kh(yyn)p(\bold{x},y|\mathcal{D})\approx \frac{1}{N}\sum_{n=1}^N\mathcal{K}_h(\bold{x}-\bold{x}_n)\mathcal{K}_h(y-y_n)

Hence, using the previous kernel properties:

E[yx,D]=n=1NKh(xxn)yKh(yyn)dyn=1NKh(xxn)Kh(yyn)dy=n=1Nynwn(x)\begin{align} \mathbb{E}[y|\bold{x},\mathcal{D}] &=\frac{\sum_{n=1}^N\mathcal{K}_h(\bold{x}-\bold{x}_n)\int y\mathcal{K}_h(y-y_n)dy}{\sum_{n=1}^N\mathcal{K}_h(\bold{x}-\bold{x}_n)\int \mathcal{K}_h(y-y_n)dy} \\ &= \sum_{n=1}^N y_nw_n(\bold{x}) \end{align}

where:

wn(x)=Kh(xxn)n=1NKh(xxn)w_n(\bold{x})=\frac{\mathcal{K}_h(\bold{x}-\bold{x}_n)}{\sum_{{n'}=1}^N \mathcal{K}_h(\bold{x}-\bold{x}_{n'})}

We see that the prediction is just a weighted sum of the training labels, where the weights depend on the similarity between x\bold{x} and the stored training points.

This method is called kernel regression, kernel smoothing, or the Nadaraya-Watson (N-W) model.

Screen Shot 2023-10-08 at 02.56.11.png

16.3.5.2 Estimator for the variance

Sometimes it can be useful to compute the predictive variance, as well as the predictive mean. We can do this by noting that:

V[yx,D]=E[y2x,D]μ(x)2\mathbb{V}[y|\bold{x},\mathcal{D}]=\mathbb{E}[y^2|\bold{x},\mathcal{D}]-\mu(\bold{x})^2

where μ(x)=E[yx,D]\mu(\bold{x})=\mathbb{E}[y|\bold{x},\mathcal{D}] is the Nadara-Watson estimate.

If we use a Gaussian kernel with variance σ2\sigma^2, we can compute:

E[y2x,D]=n=1NKh(xxn)y2Kh(yyn)dyn=1NKh(xxn)Kh(yyn)dy=n=1NKh(xxn)(σ2+yn2)n=1NKh(xxn)\begin{align} \mathbb{E}[y^2|\bold{x},\mathcal{D}] &=\frac{\sum_{n=1}^N \mathcal{K}_h(\bold{x}-\bold{x}_n)\int y^2\mathcal{K}_h(y-y_n)dy}{\sum_{n=1}^N \mathcal{K}_h(\bold{x}-\bold{x}_n)\int \mathcal{K}_h(y-y_n)dy}\\ &= \frac{\sum_{n=1}^N \mathcal{K}_h(\bold{x}-\bold{x}_n)(\sigma^2+y^2_n)}{\sum_{n=1}^N \mathcal{K}_h(\bold{x}-\bold{x}_n)} \end{align}

where we used the fact that:

y2Kh(yyn)dy=σ2+yn2\int y^2\mathcal{K}_h(y-y_n)dy=\sigma^2+y_n^2

Finally:

V[yx,D]=σ2+n=1Nyn2wn(x)μ(x)2\mathcal{V}[y|\bold{x},\mathcal{D}]=\sigma^2+\sum_{n=1}^Ny_n^2w_n(\bold{x})-\mu(\bold{x})^2

16.3.5.3 Locally weighted regression

We can drop the normalization term to get:

μ(x)=n=1NynKh(xxn)\mu(\bold{x})=\sum_{n=1}^N y_n \mathcal{K}_h(\bold{x}-\bold{x}_n)

Rather than interpolating the stored labels yny_n, we can fit a locally linear model around each training point:

μ(x)=minβn=1N(ynβxn)2Kh(xxn)\mu(\bold{x})=\min_{\beta}\sum_{n=1}^N (y_n-\beta^\top \bold{x}_n)^2\mathcal{K}_h(\bold{x}-\bold{x}_n)

This is called locally linear regression (LLR), or locally-weighted scatterplot smoothing (LOWESS or LOESS).

This is often used when annotating scatter plots with local trend lines.