Proba ML
20. Dimensionality Reduction
20.4 Manifold Learning

20.4 Manifold learning

We discuss the problem of recovering the underlying low-dimensional structure of a high-dimensional dataset. This structure is often assumed to be a curved manifold so this problem is called manifold learning or nonlinear dimensionality reduction.

The key difference from methods like AE is that we will focus on non-parametric methods, in which we compute an embedding for each point in the training set, as opposed to learning a generic model that can embed any input vector.

This is, the methods we discuss don’t (easily) support out-of-sample generalization. However, they can be easier to fit and are quite flexible.

Such methods can be useful for unsupervised learning, data visualization, and as a preprocessing step for supervised learning.

20.4.1 What are manifolds?

Roughly speaking, a manifold is a topological space that is locally Euclidean. One of the simplest examples is the surface of the earth, which is a curved 2d surface embedded in a 3d space. At each local point on the surface, the earth seems flat.

More formally, a dd-dimensional manifold X\mathcal{X} is a space in which each point xX\bold{x}\in\mathcal{X} has a neighborhood which is topologically equivalent to a dd-dimensional Euclidean space, called the tangent space, denoted Tx=TxX\mathcal{T}_x=T_x\mathcal{X}.

Screen Shot 2023-11-08 at 19.15.57.png

A Riemannian manifold is a differentiable manifold that associates an inner product operator at each xx in tangent space. This is assumed to depend smoothly on the position xx.

The collection of these inner product is called a Riemannian metric. It can be shown that any sufficiently smooth Riemannian manifold can be embedded in any Euclidean space of potentially higher dimension. The Riemannian inner product at a point becomes Euclidean inner product in the tangent space.

20.4.3 Approaches to manifold learning

The manifold methods can be categorized as:

Screen Shot 2023-11-08 at 19.24.27.png

The term “nonparametric” refers to methods that learn a low dimensional embedding zi\bold{z}_i for each datapoint xi\bold{x}_i, but do not learn a mapping function that can be applied to out-of-sample example.

To compare methods, we use 2 different datasets: a set of 1000 3d-points sampled from the 2d “Swiss roll” manifold, and a set of 1797 points from the UCI digit datasets.

We will learn a 2d manifold to visualize the data.

Screen Shot 2023-11-08 at 19.29.21.png

20.4.4 Multi-dimensional scaling (MDS)

The simplest approach to manifold learning is multi-dimensional scaling (MDS). This tries to find a set of low dimensional vectors {ziRL:i=1:N}\{\bold{z}_i\in\R^L:i=1:N\}.

20.4.4.1 Classical MDS

We start with XRN×DX\in\R^{N\times D}. The centered Gram (similarity) matrix is:

K~ij=xixˉ,xjxˉ\tilde{K}_{ij}=\lang \bold{x}_i-\bar{\bold{x}},\bold{x}_j-\bar{\bold{x}} \rang

In matrix notation, we have K~=X~X~\tilde{K}=\tilde{X}\tilde{X}^\top with X~=CNX\tilde{X}=C_N X and CN=IN1N1N1NC_N=I_N-\frac{1}{N}1_N1_N^\top

Now define the strain of a set of embeddings:

Lstrain(Z)=i,j(K~ijz~i,z~j)2=K~Z~Z~F2\mathcal{L}_{\mathrm{strain}}(Z)=\sum_{i,j}(\tilde{K}_{ij}-\lang \tilde{\bold{z}}_i ,\tilde{\bold{z}}_j\rang)^2=||\tilde{K}-\tilde{Z}\tilde{Z}^\top||^2_F

where zi=z~zi\bold{z}_i=\tilde{\bold{z}}-\bold{z}_i is the centered vector.

Intuitively, this measures how well similarities in the high-dimensional space are preserved in the low-dimensional space. Minimizing this loss is called classical MDS.

We know that the best rank LL approximation to a matrix is its truncated SVD representation K~=USV\tilde{K}=USV^\top. Since K~\tilde{K} is positive semi-definite, we have U=VU=V.

Hence the optimal embeddings satisfies:

Z~Z~=USU=(US12)(S12U)\tilde{Z}\tilde{Z}^\top=USU=(US^{\frac{1}{2}})(S^{\frac{1}{2}}U^\top)

Thus we can set the embedding vectors to be the rows of Z~=US12\tilde{Z}=US^{\frac{1}{2}}.

We can still apply the MDS when we just have the Euclidean distance DD instead of the raw data XX.

First, we compute the squared Euclidean matrix:

Dij(2)=xixj2=xixˉ2+xjxˉ22xixˉ,xjxˉ=xixˉ2+xjxˉ22K~ij\begin{align} D^{(2)}_{ij}&=||\bold{x}_i-\bold{x}_j||^2\\ &=||\bold{x}_i-\bar{\bold{x}}||^2+||\bold{x}_j-\bar{\bold{x}}||^2-2\lang \bold{x}_i-\bar{\bold{x}},\bold{x}_j-\bar{\bold{x}}\rang \\ &=||\bold{x}_i-\bar{\bold{x}}||^2+||\bold{x}_j-\bar{\bold{x}}||^2-2 \tilde{K}_{ij} \end{align}

So, we see that D(2)D^{(2)} only differs from KK a 2-2 factor and some rows and columns constants. Thus, we can apply the double centering trick on DD to obtain K~\tilde{K}:

K~=12CND(2)CN\tilde{K}=-\frac{1}{2}C_ND^{(2)} C_N

or equivalently:

K~ij=12(dij21Nk=1Ndik21Nk=1Ndkj2+1N2k=1Nl=1Ndkl2)\tilde{K}_{ij}=-\frac{1}{2}(d_{ij}^2-\frac{1}{N}\sum_{k=1}^N d_{ik}^2-\frac{1}{N}\sum_{k=1}^N d_{kj}^2+\frac{1}{N^2}\sum_{k=1}^N\sum_{l=1}^N d_{kl}^2)

We can then compute the embeddings as before.

It turns out the classical MDS is equivalent to PCA.

To see this, let K~=ULSLUL\tilde{K}=U_LS_LU_L^\top be the rank LL truncated SVD of the centered kernel matrix.

The MDS embeddings are given by ZMDS=ULSL12Z_{\mathrm{MDS}}=U_LS_L^{\frac{1}{2}}.

Now consider the rank LL SVD of the centered data matrix X~=UXSXVX.\tilde{X}=U_XS_XV_X^\top. The PCA embedding is ZPCA=UXSXZ_{\mathrm{PCA}}=U_XS_X.

Now:

K~=X~X~=ULSLUL=UXSX2UX\tilde{K}=\tilde{X}\tilde{X}^\top=U_LS_LU_L^\top=U_XS_X^2U_X^\top

Therefore, UL=UXU_L=U_X, SL=SX2S_L=S_X^2 and ZPCA=ZMDSZ_{\mathrm{PCA}}=Z_{\mathrm{MDS}}.

20.4.4.2 Metric MDS

Classical MDS assumes Euclidean distances. We can generalize it using any dissimilarity measure by defining the stress function:

Lstress(Z)=i<j(dijd^ij)2ijdij2\mathcal{L}_{\mathrm{stress}}(Z)=\sqrt{\frac{\sum_{i<j}(d_{ij}-\hat{d}_{ij})^2}{\sum_{ij} d^2_{ij}}}

where d^ij=z^iz^j\hat{d}_{ij}=||\hat{\bold{z}}_i-\hat{\bold{z}}_j||.

Note that this is a different objective than the one used with classical MDS, so even using Euclidean distances the results will be different.

We can use gradient descent to solve the optimization problem, however it’s better to use a bound algorithm problem called Scaling by Majorizing a Complication Function (SMACOF).

This is the method implemented in scikit-learn.

Screen Shot 2023-11-10 at 09.08.59.png

20.4.4.3 Non-metric MDS

Instead of trying to match the distance between points, we can instead try to match the ranking of how similar points are. Let f(d)f(d) be a monotonic transformation from distances to ranks.

The loss is:

LNM(Z)=i<j(f(dij)d^ij)2ijd^ij2\mathcal{L}_{\mathrm{NM}}(Z)=\sqrt{\frac{\sum_{i<j}(f(d_{ij})-\hat{d}_{ij})^2}{\sum_{ij} \hat{d}^2_{ij}}}

where d^ij=z^iz^j\hat{d}_{ij}=||\hat{\bold{z}}_i-\hat{\bold{z}}_j||. Minimizing this is known as non-metric MDS.

This objective can be optimized iteratively.

First, the function ff is approximated using isotonic regression; this finds the optimal monotonic transformation of the input distance to match the current embedding distances.

Then, the embeddings ZZ are optimized, using a given ff, using gradient descent, and the process repeats.

20.4.5 Isomap

If the high-dimensional data lies on or near a curved manifold, such as the Swiss roll example, then MDS might consider the two points to be close even if their distance along the manifold is large.

One way to capture this is to create the KK-nearest neighbor graph (by using scikit-learn (opens in a new tab)) between datapoints and then approximate the manifold distance between a pair of points by the shortest distance on the graph.

Screen Shot 2023-11-10 at 09.11.42.png

This can be computed efficiently using Dijkstra’s shortest path algorithm. Once we have computed this new distance metric, we use classical MDS (i.e. PCA).

This is a way to capture local structure while avoiding local optima. The overall method is called Isometric mapping (Isomap).

The results are quite reasonable.

Screen Shot 2023-11-10 at 09.18.49.png

However, if the data is noisy, there can be “false“ edges in the nearest neighbor graph, which can result in “short circuits”, which significantly distort the embedding.

Screen Shot 2023-11-10 at 09.24.57.png

This problem is known as topological instability. Choosing a very small neighborhood doesn’t solve the problem since this can fragment the manifold into a large number of disconnected regions.

See more details on scikit-learn documentation (opens in a new tab).

20.4.6 Kernel PCA

We now consider nonlinear projections of the data. The key idea is to solve PCA by finding the eigenvectors of the inner product (Gram) matrix K=XXK=XX^\top and then to use the kernel trick, which lets us replace inner product xixj\bold{x}^\top_i\bold{x}_j with kernel function K(xi,xj)\mathcal{K}(\bold{x}_i,\bold{x}_j). This is known as kernel PCA.

Recall from Mercer’s theorem that the use of kernel implies some underlying feature space, we are implicitly replacing xi\bold{x}_i by ϕi=ϕ(xi)\phi_i=\phi(\bold{x}_i).

Let Φ\Phi be the corresponding (notional) design matrix, and SΦ=1NiϕiϕiS_\Phi =\frac{1}{N}\sum_{i}\phi_i \phi_i^\top be the covariance matrix in feature space (we assume ϕi\phi_i are centered)

From section 20.1.3.2, we know:

VkPCA=ΦUΛ12V_{\mathrm{kPCA}}=\Phi^\top U\Lambda^{-\frac{1}{2}}

where UU and Λ\Lambda contains the eigenvectors and eigenvalues of KK.

We can’t compute VkPCAV_{\mathrm{kPCA}} directly since the ϕi\phi_i might be infinite dimensional. However, we can compute the projection of the test vector x\bold{x}_* on the feature space:

ϕVkPCA=ϕΦUΛ12=kUΛ12\phi_*^\top V_{\mathrm{kPCA}}=\phi^\top_*\Phi^\top U\Lambda^{-\frac{1}{2}}=k_*^\top U\Lambda{^{-\frac{1}{2}}}

where k=[K(x,x1),,K(x,xN)]k_*=[\mathcal{K}(\bold{x}_*,\bold{x}_1),\dots,\mathcal{K}(\bold{x}_*,\bold{x}_N)]

If we apply kPCA with a linear kernel we recover the regular PCA. This is limited to using LDL\leq D embedding dimensions. If we use a non-degenerate kernel, we can use up to NN components, since ΦRN×D\Phi\in\R^{N\times D^*} where DD^* is the dimensionality of embedded feature vectors (potentially infinite).

The figure below gives an example of the method with D=2D=2 using a RBF kernel.

We obtain the blue levels using the following method: for each plot, we project points in the unit grid onto each component by computing KK_* and its product with UΛ12U\Lambda^{-\frac{1}{2}}, obtained from the data points. We obtain a “density map” for the points in the grid, and then draw a contour plot.

Screen Shot 2023-11-11 at 15.54.45.png

We see that the first two components separate the three clusters, while the other split the clusters.

The figure below show that the kPCA is not very useful for our dimension reduction problem. It can be shown that kPCA expand features instead of reducing it.

Screen Shot 2023-11-11 at 17.12.11.png

20.4.7 Maximum variance unfolding (MVU)

kPCA with RBF kernel might not result in low dimensional embedding. This observation led to the development of the semidefinite embedding algorithm, aka maximum variance unfolding, which tries to learn an embedding zi\bold{z}_i such that:

maxi,jzizj2  s.t.  zizj2=xixj2,(i,j)G\max \sum_{i,j}||\bold{z}_i-\bold{z}_j||^2 \;s.t.\; ||\bold{z}_i-\bold{z}_j||^2=||\bold{x}_i-\bold{x}_j||^2,(i,j)\in G

when GG is the nearest neighbor graph (as in Isomap).

This approach tries to “unfold” the manifold while respecting its nearest neighbor constraints.

This can be reformulated as a semidefinite programming (SDP) problem by defining the kernel matrix K=ZZK=ZZ^\top and maximizing:

maxtr(K)  s.t.  zizj2=xixj2,ijKij=0,K0\max \mathrm{tr}(K)\;s.t.\;||\bold{z}_i-\bold{z}_j||^2=||\bold{x}_i-\bold{x}_j||^2, \sum_{ij} K_{ij}=0,K \succ 0

The resulting kernel is then passed to kPCA, and the resulting eigenvectors give the low dimensional embedding.

20.4.8 Local linear embedding (LLE)

The techniques discussed so far relied on the eigenvalue decomposition of a full matrix of pairwise similarities, either in the input space (PCA), in the feature space (kPCA) or along the KNN graph (Isomap).

In this section we discuss local linear embedding (LLE), a technique that solves a sparse eigenproblem, by focusing on local structure in the data.

LLM assumes the data manifold around each point xi\bold{x}_i is locally linear. The best linear approximation can be found by predicting xi\bold{x}_i as a linear combination of its KK nearest neighbors using reconstruction weights wi\bold{w}_i (aka the barycentric coordinate of xi)\bold{x}_i):

W^=arg minWi=1N(xij=1Nwijxj)2s.t.{wi,j=0if  xjNbr(xi,K)j=1Nwij=1for  i=1:N\hat{W}=\argmin_W\sum_{i=1}^N \Big(\bold{x}_i-\sum_{j=1}^N w_{ij}\bold{x}_j\Big)^2 \\ s.t. \begin{cases} w_{i,j}=0 &\mathrm{if} \;\bold{x}_j\notin \mathrm{Nbr}(\bold{x}_i,K)\\ \sum_{j=1}^Nw_{ij}=1 &\mathrm{for}\;i=1:N \end{cases}

Any linear mapping of this hyperplane to a lower dimensional space preserves the reconstruction weights, and thus the local geometry.

Thus we can solve for the low dimensional embeddings for each point by solving:

Z^=arg minZi=1Nzij=1w^ijzj2\hat{Z}=\argmin_Z \sum_{i=1}^N ||\bold{z}_i-\sum_{j=1}\hat{w}_{ij}\bold{z}_j||^2

with the same constraints.

We can rewrite this loss as:

L(Z)=ZWZ2=Z(IW)(IW)Z\mathcal{L}(Z)=||Z-WZ||^2=Z^\top(I-W)^\top(I-W)Z

As shown in section 7.4.8, the solution is given by the eigenvectors of (IW)(IW)(I-W)^\top(I-W) corresponding to the smallest non-zero eigenvalues.

Screen Shot 2023-11-11 at 17.39.06.png

The results do not seem as good as the one obtained with the Isomap, however this method tend to be more robust to short-circuiting (noise).

20.4.9 Laplacian eigenmaps (Spectral embeddings)

The key idea of Laplacian eigenmaps is to find a low-dimensional representation of data in which the weighted distances between a datapoint and its KK nearest neighbors are minimized.

We put more weight on the first nearest neighbor than the second, etc.

20.4.9.1 Using eigenvectors of the graph Laplacian to compute embeddings

We want to find the embeddings that minimizes:

L(Z)=i,jEwijzizj22\mathcal{L}(Z)=\sum_{i,j\in E} w_{ij}||\bold{z}_i-\bold{z}_j||^2_2

where wij=exp(12σ2xixj22)w_{ij}=\exp(-\frac{1}{2\sigma^2}||\bold{x}_i-\bold{x}_j||^2_2) if (i,j)(i,j) are neighbors on the KNN graph, 0 otherwise.

We add the constraint ZDZ=IZ^\top D Z=I to avoid the degenerate solution where Z=0Z=\bold{0}, where DD is the diagonal weight matrix storing the degree of each node Dii=jwijD_{ii}=\sum_jw_{ij}

We can rewrite the objective as follows:

L(Z)=ijwij(zi2+zj22zizj)=iDiizi2+jDjjzj22ijwijzizj=2tr(ZDZ)2tr(ZWZ)=2tr(ZLZ)\begin{align} \mathcal{L}(Z)&= \sum_{ij} w_{ij}(||\bold{z}_i||^2+||\bold{z}_j||^2-2\bold{z}_i^\top\bold{z}_j)\\ &= \sum_{i}D_{ii} ||\bold{z}_i||^2+\sum_{j}D_{jj} ||\bold{z}_j||^2-2\sum_{ij}w_{ij}\bold{z}_i^\top \bold{z}_j \\ &= 2\mathrm{tr}(Z^\top DZ)-2\mathrm{tr}(Z^\top WZ) \\ &=2\mathrm{tr}(Z^\top LZ) \end{align}

where LDWL\triangleq D-W is the graph Laplacian.

One can show that minimizing this graph is equivalent to solving the (generalized) eigenvalue problem:

Lzi=λiDziL\bold{z}_i=\lambda_iD \bold{z}_i

for the kk smallest nonzero eigenvalues.

Screen Shot 2023-11-11 at 20.21.31.png

20.4.9.2 What is the graph Laplacian?

We saw above that we need to compute the eigenvectors of the graph Laplacian to compute a good embedding of the high dimensional points. We now give some intuitions on why this works.

We saw that the elements of LL can be computed as:

Lij={diiif  i=jwijo.w.L_{ij}=\begin{cases} d_{ii} &\mathrm{if}\;i=j \\ -w_{ij} &\mathrm{o.w.} \end{cases}

Screen Shot 2023-11-11 at 20.39.17.png

Suppose we associate a value fif_i to each node of the graph:

Screen Shot 2023-11-13 at 08.05.10.png

Then we can use the graph Laplacian as a difference operator, to compute a discrete derivative of a function at a point:

(Lf)(i)=jnbr(i)wij(f(i)f(j))(Lf)(i)=\sum_{j\in\mathrm{nbr}(i)} w_{ij}(f(i)-f(j))

where nbr(i)\mathrm{nbr}(i) refers to the set of neighbors of node ii.

We can also compute an overall measure of smoothness of the function by computing its Dirichlet energy:

fLf=fDffWf=idifi2i,jwijfj2=12(idifi22i,jwijfifj+jdjfj2)=12i,jwi,j(fifj)2\begin{align} f^\top Lf &=f^\top Df-f^\top W f \\ &= \sum_{i}d_if_i^2-\sum_{i,j} w_{ij}f_j^2 \\ &= \frac{1}{2}\Big(\sum_{i}d_if_i^2-2\sum_{i,j} w_{ij}f_i f_j+\sum_{j}d_jf_j^2\Big) \\ &=\frac{1}{2}\sum_{i,j} w_{i,j}(f_i-f_j)^2 \end{align}

Applying linear algebra to study the adjacency matrix of a graph is called spectral graph theory.

For example, we see that LL is semi-positive definite, since fLf0f^\top Lf\geq 0 for all fRNf\in \R^N.

Consequently, LL has NN nonzero, real-valued eigenvalues 0λ1λN0\leq \lambda_1\leq \dots \leq \lambda_N.

The corresponding eigenvectors form an orthogonal basis for the function ff defined on the graph, in order of decreasing smoothness.

There are many applications of the graph Laplacian in ML, like normalized cuts (which is way of learning a clustering based on pairwise similarity) and RL where the eigenvectors of the state transition matrix are used.

20.4.10 t-SNE

We describe a very popular nonconvex technique for learning low dimensional embeddings, called t-SNE. This is an extension of the earlier stochastic neighbor embedding (SNE) (opens in a new tab).

20.4.10.1 Stochastic neighbor embedding (SNE)

The basic idea is to convert a high-dimensional Euclidean distance into a conditional probability. More precisely, we define pjip_{j|i} to be the probability that point ii would pick point jj as its neighbor:

pji=exp(12σi2xixj2)kiexp(12σi2xixk2)p_{j|i}=\frac{\exp(-\frac{1}{2\sigma_i^2}||\bold{x}_i-\bold{x}_{j}||^2)}{\sum_{k\neq i}\exp(-\frac{1}{2\sigma_i^2}||\bold{x}_i-\bold{x}_k||^2)}

Here σi2\sigma^2_i is the variance of data point ii, which can be used to magnify the scale of points in dense regions of input space, and diminish the scale in sparser regions.

Let zi\bold{z}_i be the low-dimensional embedding representing xi\bold{x}_i. We define similarities in the low-embedding space in a similar way:

qji=exp(zizj2)kiexp(zizk2)q_{j|i}=\frac{\exp(-||\bold{z}_i-\bold{z}_j||^2)}{\sum_{k\neq i}\exp(-||\bold{z}_i-\bold{z}_k||^2)}

If the embedding is good, then qjiq_{j|i} should match pjip_{j|i}. Therefore, SNE defines the objective:

L=iDKL(PiQi)=ijpjilogpjiqji\mathcal{L}=\sum_{i}D_{\mathbb{KL}}(P_i||Q_i)=\sum_i\sum_j p_{j|i}\log \frac{p_{j|i}}{q_{j|i}}

where PiP_i is the conditional distribution over all other data points given xi\bold{x}_i.

Note that this is an asymmetric objective: the cost is high if a small qjiq_{j|i} is used to model a large pjip_{j|i}. This objective will prefer pulling distance point together rather than pushing nearby point apart.

We can get a better sense of the geometry by computing the gradient of the objective at zi\bold{z}_i:

ziL(Z)=2j(zjzi)(pjiqji+pijqij)\nabla_{\bold{z}_i}\mathcal{L}(Z)=2\sum_{j}(\bold{z}_j-\bold{z}_i)(p_{j|i}-q_{j|i}+p_{i|j}-q_{i|j})

Thus points are pulled together if the pp’s are bigger than the qq’s, and repelled otherwise.

This objective is not convex, but it can be minimized using SGD nevertheless. In practice, it is helpful to add Gaussian noise to the embeddings and anneal the amount of noise.

20.4.10.2 Symmetric t-SNE

We can simplify SNE by minimizing a single KL between the joint distribution PP and QQ:

L=DKL(PQ)=i<jpijlogpijqij\mathcal{L}=D_{\mathbb{KL}}(P||Q)=\sum_{i<j}p_{ij}\log \frac{p_{ij}}{q_{ij}}

This is called symmetric SNE.

We can define pijp_{ij} as:

pij=exp(12σ2xixj2)k<lexp(12σ2zizj2)p_{ij}=\frac{\exp(-\frac{1}{2\sigma^2}||\bold{x}_i-\bold{x}_j||^2)}{\sum_{k<l} \exp(-\frac{1}{2\sigma^2}||\bold{z}_i-\bold{z}_j||^2)}

This corresponds to the gradient:

ziL(Z)=2j(zizj)(pijqij)\nabla_{\bold{z}_i}\mathcal{L}(Z)=2\sum_{j}(\bold{z}_i-\bold{z}_j)(p_{ij}-q_{ij})

Which gives similar results than regular SNE when LDL\ll D.

20.4.10.3 t-SNE

A fundamental problem with SNE and many other embedding techniques is that they tend to squeeze points that are far in the high dimensional space close together in the low dimensional embedding space, this is called the crowding problem. This is due to the use of squared errors (or Gaussian probabilities).

One solution to this is to use a probability distribution in latent space that has heavier tails, which eliminates the unwanted force of attraction between distant points in the high dimensional space.

We can choose the Student-t distribution. In t-SNE, they set the degree ν=1\nu=1, so the distribution is equivalent to a Cauchy:

qij=(1+zizj2)1k<l(1+zkzl2)1q_{ij}=\frac{(1+||\bold{z}_i-\bold{z}_j||^2)^{-1}}{\sum_{k<l}(1+||\bold{z}_k-\bold{z}_l||^2)^{-1}}

This uses the same objective as the symmetric SNE, and the gradient turns out to be:

ziL(Z)=4j(zizj)(pijqij)(1+zizj2)1\nabla_{\bold{z}_i}\mathcal{L}(Z)=4\sum_j(\bold{z}_i-\bold{z}_j)(p_{ij}-q_{ij})(1+||\bold{z}_i-\bold{z}_j||^2)^{-1}

The new term acts like an inverse square law. This means data points is the embedding space act like stars and galaxies, forming well separated clusters each of which has many stars tightly packed inside.

This can be useful to separate different classes of data in a unsupervised way.

Screen Shot 2023-11-13 at 09.32.10.png

20.4.10.4 Choosing the length scale

The local bandwidth σi2\sigma_i^2 is an important parameter in t-SNE, and is usually chosen so that PiP_i has a perplexity chosen by the user, defined as 2H(Pi)2^{\mathbb{H}(P_i)} where:

H(Pi)=jpjilogpji\mathbb{H}(P_i)=-\sum_j p_{j|i}\log p_{j|i}

This can be interpreted as smooth measure of the effective number of neighbors.

Unfortunately, the results of t-SNE can be quite sensitive to the perplexity parameter, so it is often advised to run the algorithm with different values.

Screen Shot 2023-11-14 at 08.27.02.png

If the perplexity is too small, the algorithm will find structure within the cluster which is not present. At perplexity 30 (the default for scikit-learn), the clusters seem equidistant in the embedding space.

20.4.10.5 Computational issues

The Naive implementation of t-SNE takes O(N2)O(N^2) time. A faster version can be created by leveraging an analogy to the NN-body simulation in physics.

In particular, the gradient requires computing the forces of NN points on each of NN points. However, points that are far away can be grouped into clusters (computationally speaking), and their effective force can be approximated by a few representatitve points per cluster.

We can then approximate the forces using the Barnes-Hut algorithm, which takes O(NlogN)O(N\log N).

Unfortunately, this only works well for low dimensional embeddings, such as L=2L=2 (which is adapted to data visualization).

20.4.11 UMAP

Various extension of t-SNE have been proposed to improve its speed, the quality of its embedding space or the ability to embed into more than 2 dimensions.

Uniform Manifold Approximation and Projection (UMAP) is a popular recent extension. At a high level, it is similar to t-SNE, but it tends to preserve global structure better, and it is much faster. This makes it easier to try different hyperparameters.