UMAP: Uniform Manifold Approximation and Projection

In partial fulfilment of CS460: Machine Learning | By Jyotirmaya Shivottam[a][b]



Table of Contents

Introduction
How UMAP works?
How UMAP works? (Detailed)
Code & Visualization
References & Additional Resources

If you encounter errors of any sort, please open an issue here: GitHub

Introduction

Dimensionality reduction techniques form a core part of Data Science and Applied Machine Learning. These methods help in visualizing data, as well as pre-processing it for various machine learning pipelines. Due to computational constraints, we usually want to know, which of the features in our dataset are actually important to what we are studying and dimensionality reduction is a standard approach to figure out the relevant or "latent" features. As a bonus, these methods help us obtain a low-dimensional embedding of the data, that is helpful for visualization purposes, with minimal to no loss of contextual information.

Many dimensionality reduction algorithms exist today, broadly falling under either Matrix Factorization methods, such as Principal Component Analysis, Autoencoder or Word2Vec, or methods, that make use of Neighbor Graphs, like Laplacian Eigenmaps or t-Distributed Stochastic Neighbor Embedding (tSNE). The difference between both types comes down to the preservation of overall structure of the data. Former algorithms preserve the pairwise distance between datapoints, while the latter preserve either the local or global structure. tSNE has been the state-of-the-art algorithm for almost a decade now, prioritizing local, over global structure, only recently being supplanted by UMAP, that claims to preserve complete local structure and most of the global structure.

UMAP, an acronym for Uniform Manifold Approximation and Projection, is a recent unsupervised ML technique, that has rapidly increased in popularity and usage and is now considered state-of-the-art, owing to the massive speed improvements and scalability it delivers over tSNE and other algorithms. Moreover, the original paper (referenced below) also lays out a nice mathematical foundation for the Neighbor Graphs approach, that is helpful for further research and applications. For example, HDBSCAN is a clustering algorithm, that has seen better results, when applied to outputs from UMAP.

If one reads the original paper, one shall find that it does not shy away from getting into the mathematical details. This is a good thing, because, as mentioned earlier, it has provided a framework for Neighbor Graph algorithms. So, let us delve into a little descriptive math, that shapes this algorithm. However, for most of us, it can get rather arcane. So, let us distil down the algorithm and talk about the high-level components, that shape it. So here, I offer you a choice:

You gotta click/tap either pill.

How UMAP works?

Broadly speaking, the algorithm can be understood as a 3-step process:

Step 1: Creation of Fuzzy complexes


The algorithm takes each datapoint and identifies their \(n-\)Nearest Neighbors. Then, it plots sets or complexes containing all these neighbors, around each point. These sets are "fuzzy", as they are based not on whether a point falls in this neighborhood set, but on a probabilistic weight, from \(0\) to \(1\). Here, \(n\) is a hyperparameter for this algorithm.

Let us consider an example of a noisy sine-wave distribution to better visualize this. Also, for simplicity's sake, let us assume these "fuzzy sets or complexes" to be circles. Noisy Sine Wave Distribution Fixing \(n = 1\), which is equivalent to a circle radius of \(1\), we arrive at the following result: Fuzzy Complexes

Step 2: Graph Generation


The next step entails using the fuzzy sets to generate a graph, where the datapoints act as nodes; edges are formed only between the \(n-\)neighbors, and the weights are assigned, based on the distance function, chosen during the first step, which, for our case, is pairwise distance, giving us a probabilistic output between \(0\) and \(1\). Graph Remember, this is just another representation of the second image above, but this is computationally far more convenient to work with (as it is just a graph). Another notable point is that this representation does a reasonable job of starting to capture the fundamental structure of the dataset. The cause for this is rooted in a theorem called the Nerve Theorem.

Step 3: Force-Directed Layout


Now, we want to find a low-dimensional representation of our data. In UMAP, as with kNN, the closeness of two points is taken as a measure of how "related" they are. As such, the weights on the edges play an important role in clustering related datapoints. UMAP considers the weights as a "physical force" of sorts, pushing or pulling on each point, and passes it through a particular loss function, that can account for these forces and give us a stable output. As an additional step, it optimizes the loss function to get the most suitable low-dimensional representation of our data. Since the weights are simply Bernoulli variables (probabilistic), UMAP uses Binary Cross Entropy as the loss function. Its mathematical form is given below: \[ \boxed{\sum_{e \in E} w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right) + (1 - w_h(e))\log\left(\frac{1 - w_h(e)}{1 - w_l(e)}\right)} \] Here, \(E\) is the edge-set and \(w_h\) and \(w_l\) represent the weights in the high and low dimensional cases, respectively. The first term provides the "attractive force" across an edge, whenever there is a larger weight associated to the high dimensional case, because for large \(w_l\), this term will be minimized, which signifies the smallest possible distance between datapoints. On the other hand, the second term provides a "repulsive force", that is minimized by smaller values for \(w_l\), implying a larger distance between datapoints. So, in a sense, the first term helps group related data (local structure), while the second term helps in spacing out unrelated data (global structure).

From a machine learning perspective, note that, the loss function is "local", as it depends on pairs of datapoints. As such, the weights depends on at most \(2k - 1\) points, implying that the computational cost does not grow with dataset size. Finally, optimizing this push & pull process through Stochastic Gradient Descent or similar algorithms, we arrive at a stable low-dimensional representation of our data, that relatively accurately captures the overall structure. Notice the clean separation between yellower and bluer points, as also present in the original data. UMAP Output

Now, let us take some real-world datasets and apply UMAP to better understand the hyperparameters of this algorithm.


How UMAP works? (Detailed)

Broadly speaking, UMAP assumes an underlying manifold structure to the data and uses local manifold approximations to obtain fuzzy simplicial sets, that are patched together to construct a topological representation of the high-dimensional data. This is essentially a Force-Directed graph. A similar process can be used to construct a low-dimensional representation and then UMAP minimizes the Binary Cross Entropy between these representations to obtain a stable low-dimensional representation.

Sidenote: A simplicial complex is just a collection of simplices, glued together. See picture below - smaller simplices have been attached to form larger simplicial complexes. Simplicial Complex

Assume that, the dataset has a topological structure to it. Then, as with most dimensionality reduction techniques, we want to accurately capture this topology of the datapoint-space. An initial step would be to try to generate an Open Cover. Let us take a noisy sine-wave distribution, assumed to be associated with a manifold. Noisy Sine Wave Distribution Also, for simplicity's sake, let us assume the covered open sets to be unit circles. Fuzzy Complexes Recall that, presence of a manifold implies presence of a metric and this help us define local distances. We can utilize these to form simplicial complexes, specifically the Vietoris–Rips Complex. Then, we can describe the simplicial complexes of 0-, 1-, and 2-simplexes (all in 2D) as points, lines, and triangles. Vietoris–Rips Complex Note that, this captures the topology really well. The gaps and locally dense clusters have been represented by absence and over-abundance of edges, respectively. The cause for this is rooted in the Nerve Theorem.

While all this sounds feasible, applying the same on actual data has more practicalities. These have been detailed below:

How do we choose the correct radius for our cover?

Usually, the answer is trial and error. But we know that algorithms, lke Laplacian Eigenmaps overcome this issue with a simple uniformity assumption on the data. This assumption may not hold all the time, but this leads us to ask, what is there to know about a uniform dataset?

Enter Riemannian Geometry.

General Relativity folks might know the idea. Expand and contract the datapoint-space, so that the uniform-distance assumption becomes true, i.e., define local metric spaces, so that the neighborhood distances remain constant. Ball Plot in Euclidean 2D space In the image above, the distances seem to vary, but that is a result of projecting from Riemannian to Euclidean space. With the definitions of the local distance functions (metric), the distances are, in fact, the same. As a bonus, these functions help us determine the edge-weights in the simplicial complex. We can go one step further and make these weights fuzzy, i.e., the certainty of the presence of a datapoint in a circle of a given radius decays, as we move away from the center of the circle. Fuzzy Ball Cover

What about datapoints, that are completely isolated in the higher dimensions?

Here, UMAP makes an assumption of local connectivity, i.e., nearest-neighbor distance is prioritized over absolute distances. This effectively bypasses the Curse of Dimensionality, as neigborhood distances are forced to be small and constant, due to the Riemannian metric.

But are the local metrics not incompatible?

Yes, they are. But recall that for our use-case, simplicial complexes are essentially graphs. To overcome this issue, we can make our graph directed with mutual weights assigned to pairs of edges between neighborhood points. Directed Graph with incompatible weights Now, we need to symmetrize. And UMAP does this via a probabilistic interpretation of the weights (say, \(a\) & \(b\) for edges between 2 points), giving us the combined weight as: \[\boxed{a + b - a.b}.\] Think of \(a\) and \(b\) as effective probabilities of existence of edges between two points. Then, the overall probability is the sum of these probabilities, minus their product. Applying this process to all the points, we can combine all the fuzzy simplicial sets, ending up with a single complex or analogously, a weighted graph. Directed Graph with symmetrized weights

How do we find the low-dimensional topological embedding?

From the manifold assumption about the data, we know that the data is located on the low-dimensional Euclidean space, which is what, we are trying to embed. This provides us a global Euclidean distance measure. Now, we need to match our fuzzy topology with a good low-dimensional topology and this can be taken as an Optimization problem. UMAP chooses Binary Cross Entropy to measure the dissimilarity between the two representations. Its mathematical form is given below: \[ \boxed{\sum_{e \in E} w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right) + (1 - w_h(e))\log\left(\frac{1 - w_h(e)}{1 - w_l(e)}\right)} \] Here, \(E\) is the edge-set and \(w_h\) and \(w_l\) represent the weights in the high and low dimensional cases, respectively. The first term provides an "attractive force" across an edge, whenever there is a larger weight associated to the high dimensional case, because for large \(w_l\), this term will be minimized, which signifies the smallest possible distance between datapoints. On the other hand, the second term provides a "repulsive force", that is minimized by smaller values for \(w_l\), implying a larger distance between datapoints. So, in a sense, the first term helps group related data (local structure), while the second term helps in spacing out unrelated data (global structure).

From a machine learning perspective, note that, this loss function is "local", as it depends on pairs of datapoints. As such, the weights depends on at most \(2k - 1\) points, implying that the computational cost does not grow with dataset size. Finally, optimizing this push & pull process, we arrive at a stable low-dimensional representation of our data, that relatively accurately captures the overall structure. Notice the clean separation between yellower and bluer points, as also present in the original data. UMAP Output Sidenote: In practice, UMAP uses Random Projection Trees and Nearest Neighbor Descent to efficiently obtain the nearest neighbors and the fuzzy simplicial complexes. For optimization, it uses Stochastic Gradient Descent with negative sampling to quickly reach a stable, low-dimensional representation.

Now let us take some real world datasets and apply UMAP to better understand the hyperparameters of this algorithm.


Code & Visualization

For this part, we will make use of umap-learn, a Python library from the authors of the original paper. To run the code on your machine, please go through the requirements of the package and then install it through conda or python -m pip install umap-learn[plot]. Here, [plot] installs the dependencies for the plotting module. Also, all the code snippets here can be found in a Jupyter Notebook.

Now, let us import the required libraries and load or fetch some datasets - Digits, MNIST and FashionMNIST. We will be using the latter two for showcasing how UMAP would embed them into a 2D space, while the analysis of UMAP hyperparameters would be done on the smaller Digits dataset. Also, note that, while umap-learn supports supervised learning through an optional parameter, we will be focusing on the unsupervised aspect.

import umap as u
# umap.plot may raise errors as of umap-learn-0.4.6
from umap import plot as uplot

import matplotlib.pyplot as plt
from sklearn import datasets

# Datasets
digits = datasets.load_digits()
mnist = datasets.fetch_openml('mnist_784')
fmnist = datasets.fetch_openml('Fashion-MNIST')
Having loaded in the datasets, let us define a small function to help visualize what these datasets look like.
def plot_data(data):
    """
    Plots part of the data (100 datapoints)
    Adapted from: https://umap-learn.readthedocs.io/en/latest/basic_usage.html

    Parameters
    ----------
    data: numpy.ndarray
        Image (2D) data to plot

    """
    fig, ax_array = plt.subplots(10, 10)
    axes = ax_array.flatten()

    for i, ax in enumerate(axes):
        if data[i].ndim != 2:
            size = int(np.sqrt(data[i].shape[0]))
            ax.imshow(data[i].reshape((size, size)), cmap='gray_r')
        else:
            ax.imshow(data[i], cmap='gray_r')

    plt.setp(axes, xticks=[], yticks=[], frame_on=False)
    plt.tight_layout(pad=0., h_pad=0.1, w_pad=0.)

plot_data(digits.images)
plot_data(mnist.data)
plot_data(fmnist.data)
Digits Dataset MNIST Dataset FashionMNIST Dataset
As we can see, Digits is just a lesser-resolution handwritten digits dataset, as compared to MNIST; and FashionMNIST is a fashion-item counterpart to MNIST, with similar attributes. umap provides a class, UMAP, that has fit() and fit_transform() methods to create an embedding in low-dimensional space and to return the embedded data, respectively. To visualize the effect of UMAP, we will use UMAP().fit() (with default hyperparameters) on MNIST and FashionMNIST and plot the output as an interactive graph. Since, bokeh, the python library used for interactive plotting, does not support an arbitrarily large number of labels, we will be using a subset of the datasets.
# MNIST: 30000
hover_data_mnist = pd.DataFrame({'index':np.arange(30000), 'label':mnist.target[:30000]})
hover_data_mnist['item'] = hover_data_mnist.label.map(
    {
        '0':'0',
        '1':'1',
        '2':'2',
        '3':'3',
        '4':'4',
        '5':'5',
        '6':'6',
        '7':'7',
        '8':'8',
        '9':'9',
    }
)

labels_mnist = mnist.target[:30000]

p_mnist30k = uplot.interactive(map_mnist30k, labels=labels_mnist, hover_data=hover_data_mnist, point_size=2, theme="fire")
# Styling code is in Jupyter Notebook
uplot.show(p_mnist30k)

# FashionMNIST: 30000
hover_data_fmnist = pd.DataFrame({'index':np.arange(30000), 'label':fmnist.target[:30000]})
hover_data_fmnist['item'] = hover_data_fmnist.label.map(
    {
        '0':'T-Shirt/top',
        '1':'Trouser',
        '2':'Pullover',
        '3':'Dress',
        '4':'Coat',
        '5':'Sandal',
        '6':'Shirt',
        '7':'Sneaker',
        '8':'Bag',
        '9':'Ankle Boot',
    }
)

labels_fmnist = fmnist.target[:30000]

p_fmnist30k = uplot.interactive(map_fmnist30k, labels=labels_fmnist, hover_data=hover_data_fmnist, point_size=2, theme="fire")
# Styling code is in Jupyter Notebook
uplot.show(p_fmnist30k)
These interactive plots are a nice way to play around with the output and see, how UMAP groups together related points. However, one might observe that some datapoints seem to merge together, while a number of points seem to be mis-grouped. This is an artefact of choosing a smaller subset. If we take the entire dataset, we achieve far better results.
map_mnist = u.UMAP().fit(mnist.data)
map_fmnist = u.UMAP().fit(fmnist.data)
uplot.points(map_mnist, labels=mnist.target, width=400, height=400, theme="fire")
uplot.points(map_fmnist, labels=fmnist.target, width=400, height=400, theme="fire")
# Styling code is in Jupyter Notebook
2D Embedding for MNIST 2D Embedding for FashionMNIST
From the plots, we can see, how UMAP accurately groups together "related" datapoints for each category, while also maintaining a sufficiently large gap between "less related" datapoints, thereby correctly capturing the local information. More specifically, note that, for MNIST, fours and nines are close together, but these are further away from sixes and zeros, which use quite different strokes. This is an example of how UMAP successfully captures the global information.

Similar properties can be observed with the FashionMNIST plot. Footwear items (Sneakers, Boots, Sandals) are spaced away from topwear items (T-Shirts, Dresses, Coat, Pullover), while locally, relevant items are huddled together.

Now that we have gained a qualitative understanding of UMAP's output, let us tweak its hyperparameters and observe the changes, on Digits dataset. The most important hyperparameter is certainly the number of neighbors, \(n\), but the local separation between points become important to make sense of an embedding. This is accomplished by tweaking min_dist and spread. As the names suggest, these control the minimum distance between the points (local) and the overall spread of the data (global), respectively, with min_dist \(\le\) spread. Here, we will keep spread fixed at its default value of \(1.0\) and change only n_neighbors and min_dist, as these two effectively control the balance between local and global structures.

First, let us plot the embedding for Digits, using default UMAP hyperparameters (n_neighbors \(= 15\) and min_dist \(= 0.1\)):
hover_data_digits = pd.DataFrame({'index':digits.data.shape[0], 'label':digits.target})

labels_digits = digits.target

p_digits = uplot.interactive(map_digits, labels=labels_digits, hover_data=hover_data_digits, point_size=2, theme="fire")
# Styling code is in Jupyter Notebook
uplot.show(p_digits)
Now, let us take combinations of various n_neighbors and min_dist values and plot them together.
import itertools

n_n = [5, 30]
m_d = [0.1, 0.9]
combinations = list(itertools.product(n_n, m_d))

for n, md in combinations:
    embed = vary_hyperparameters(n_neighbors=n, min_dist=md)
    uplot.connectivity(embed, labels=digits.target, show_points=True, width=400, height=400, background="black", color_key_cmap="rainbow", edge_cmap="seismic_r")
Embedding for Digits Dataset with n = 5 and min_dist = 0.1 Embedding for Digits Dataset with n = 5 and min_dist = 0.9 Embedding for Digits Dataset with n = 30 and min_dist = 0.1 Embedding for Digits Dataset with n = 30 and min_dist = 0.9
A few things jump at us from these plots. Let us summarize these: A excellent resource to experiment with these UMAP hyperparameters on the web, is Understanding UMAP by Andy Coenen and Adam Pearce. They have also provided a comparison between UMAP and tSNE, the former state-of-the-art. Some other resources for similar visualizations and comparisons can be found in the next section. Meanwhile, consider the following tables from the original UMAP paper:
UMAP: Runtime comparison (Open in new tab to get an enlarged view) UMAP: Scaling with input size (Open in new tab to get an enlarged view)
These massive runtime and scalability improvements make UMAP a powerful data preprocessing tool. However, one must understand the hyperparameters of the algorithm before using it, as the algorithm is stochastic and reruns are needed to properly interpret the output. Also, even though UMAP is state-of-the-art and should form a part of any ML pipeline, it is not a panacea. For example, for linear problems, algorithms like PCA may be faster and provide a far better understanding of the data. UMAP may also struggle with sparse datasets, although this is typical for Network Graph algorithms. Understanding UMAP and the math behind it is recommended, as it provides excellent insights into a whole class of dimension-reduction algorithms and has led to newer algorithms, that are more suited than UMAP to particular applications (such as DensMap).


References & Additional Resources