Proba ML
21. Clustering
21.1 Intro

21.1 Introduction

Clustering is a very common form of unsupervised learning. There are two main kinds of method.

In the first approach, the input is a set of data samples D={xn:n=1:N}\mathcal{D}=\{\bold{x}_n:n=1:N\}, where typically xnRD\bold{x}_n\in\R^D.

In the second approach, the input is a pairwise dissimilarity matrix DRN×ND \in\R^{N\times N}, where Dij0D_{ij}\geq 0.

In both cases, the goal is to assign similar points to the same cluster. As often with unsupervised learning, it is hard to evaluate the quality of a clustering algorithm.

If we have labels for some points, we can use the similarity between the labels of two points as a metric for determining if the two inputs “should” be assigned to the same cluster or not.

If we we don’t have labels, but the method is based on generative model of the data, we can use log likelihood as a metric.

21.1.1 Evaluating the output of clustering methods

If we use probabilistic models, we can always evaluate the likelihood of the data, but this has two drawbacks:

  1. It does not assess the clustering discovered by the model
  2. It does not apply to non-probabilistic methods

Hence, we review performance measures not based on likelihood.

Intuitively, similar points should be assigned to the same cluster. We can rely on some external form of data, like labels or a reference clustering, by making the hypothesis that objects with the same label are similar.

21.1.1.1 Purity

Let NijN_{ij} be the number of objects in cluster ii that belong to class jj, with Ni=j=1CNijN_i=\sum_{j=1}^C N_{ij} be the total number of objects in cluster ii.

Define pij=Nij/Nip_{ij}=N_{ij}/N_i the empirical probability of the class jj in cluster ii.

We define the purity of a cluster as:

pimaxjpijp_i\triangleq \max_j p_{ij}

and the overall purity of a clustering as:

purityiNiNpi\mathrm{purity}\triangleq \sum_i \frac{N_i}{N}p_i

Screen Shot 2023-11-18 at 16.59.23.png

On the figure above we have:

purity=61756+61746+51735=1217=0.71\mathrm{purity}=\frac{6}{17}\frac{5}{6}+\frac{6}{17}\frac{4}{6}+\frac{5}{17}\frac{3}{5}=\frac{12}{17}=0.71

However, this measure doesn’t penalize the number of cluster, since we can achieve the best purity of 11 by trivially putting each object in its own cluster.

21.1.1.2 Rand index

Let U={u1,,uR}U=\{u_1,\dots,u_R\} and V={v1,,vC}V=\{v_1,\dots,v_C\} two different partition of the NN data points. For example, UU could be the estimated clustering and VV some reference clustering derived from the class labels.

A common statistics is the Rand index:

RTP+TNTP+TP+FP+FNR\triangleq \frac{TP+TN}{TP+TP+FP+FN}

which is analogous to the accuracy for classification.

21.1.1.3 Mutual information

I(U,V)=i=1Rj=1CpUV(i,j)logpUV(i,j)pU(i)pV(j)\mathbb{I}(U,V)=\sum_{i=1}^R\sum_{j=1}^C p_{UV}(i,j)\log \frac{p_{UV}(i,j)}{p_U(i)p_V(j)}

where:

  • pUV(i,j)=uivj/Np_{UV}(i,j)=|u_i \cap v_j|/N is the probability that a randomly chosen object belong to the cluster uiu_i in UU and vjv_j in VV
  • pU(i)=ui/Np_U(i)=|u_i|/N be the probability that a randomly chosen object belong to the cluster uiu_i.

This lies between 0 and min{H(U),H(V)}\min\{\mathbb{H}(U),\mathbb{H}(V)\}.

Unfortunately, the maximum can be achieved by using a lot of small clusters, which have low entropy.

To compensate for this, we can use the normalized mutual information:

NMI(U,V)=I(U,V)(H(U)+H(V))/2NMI(U,V)=\frac{\mathbb{I}(U,V)}{(\mathbb{H}(U)+\mathbb{H}(V))/2}

This lies between 00 and 11