Introduction
How UMAP works?
How UMAP works? (Detailed)
Code & Visualization
References & Additional Resources
Dimensionality reduction techniques form a core part of Data Science and Applied Machine Learning. These methods help in visualizing data, as well as pre-processing it for various machine learning pipelines. Due to computational constraints, we usually want to know, which of the features in our dataset are actually important to what we are studying and dimensionality reduction is a standard approach to figure out the relevant or "latent" features. As a bonus, these methods help us obtain a low-dimensional embedding of the data, that is helpful for visualization purposes, with minimal to no loss of contextual information.
Many dimensionality reduction algorithms exist today, broadly falling under
either Matrix Factorization methods, such as Principal Component Analysis, Autoencoder or Word2Vec, or methods, that make use of Neighbor Graphs, like Laplacian Eigenmaps or t-Distributed Stochastic Neighbor Embedding (tSNE). The difference between both types comes down to the preservation of overall structure of the data. Former algorithms preserve the pairwise distance between datapoints, while the latter preserve either the local or global structure. tSNE has been the state-of-the-art algorithm for almost a decade now, prioritizing local, over global structure, only recently being supplanted by UMAP, that claims to preserve complete local structure and most of the global structure.
UMAP, an acronym for Uniform Manifold Approximation and Projection, is a recent unsupervised ML technique, that has rapidly increased in popularity and usage and is now considered state-of-the-art, owing to the massive speed improvements and scalability it delivers over tSNE and other algorithms. Moreover, the original paper (referenced below) also lays out a nice mathematical foundation for the Neighbor Graphs approach, that is helpful for further research and applications. For example, HDBSCAN is a clustering algorithm, that has seen better results, when applied to outputs from UMAP.
If one reads the original paper, one shall find that it does not shy away from getting into the mathematical details. This is a good thing, because, as mentioned earlier, it has provided a framework for Neighbor Graph algorithms. So, let us delve into a little descriptive math, that shapes this algorithm. However, for most of us, it can get rather arcane. So, let us distil down the algorithm and talk about the high-level components, that shape it. So here, I offer you a choice:
Broadly speaking, the algorithm can be understood as a 3-step process:
Now, let us take some real-world datasets and apply UMAP to better understand the hyperparameters of this algorithm.
Broadly speaking, UMAP assumes an underlying manifold structure to the data and uses local manifold approximations to obtain fuzzy simplicial sets, that are patched together to construct a topological representation of the high-dimensional data. This is essentially a Force-Directed graph. A similar process can be used to construct a low-dimensional representation and then UMAP minimizes the Binary Cross Entropy between these representations to obtain a stable low-dimensional representation.
Sidenote: A simplicial complex is just a collection of simplices, glued together. See picture below - smaller simplices have been attached to form larger simplicial complexes.
Assume that, the dataset has a topological structure to it. Then, as with most dimensionality reduction techniques, we want to accurately capture this topology of the datapoint-space. An initial step would be to try to generate an Open Cover. Let us take a noisy sine-wave distribution, assumed to be associated with a manifold.
Also, for simplicity's sake, let us assume the covered open sets to be unit circles.
Recall that, presence of a manifold implies presence of a metric and this help us define local distances. We can utilize these to form simplicial complexes, specifically the Vietoris–Rips Complex. Then, we can describe the simplicial complexes of 0-, 1-, and 2-simplexes (all in 2D) as points, lines, and triangles.
Note that, this captures the topology really well. The gaps and locally dense clusters have been represented by absence and over-abundance of edges, respectively. The cause for this is rooted in the Nerve Theorem.
While all this sounds feasible, applying the same on actual data has more practicalities. These have been detailed below:
How do we choose the correct radius for our cover?
Usually, the answer is trial and error. But we know that algorithms, lke Laplacian Eigenmaps overcome this issue with a simple uniformity assumption on the data. This assumption may not hold all the time, but this leads us to ask, what is there to know about a uniform dataset?What about datapoints, that are completely isolated in the higher dimensions?
Here, UMAP makes an assumption of local connectivity, i.e., nearest-neighbor distance is prioritized over absolute distances. This effectively bypasses the Curse of Dimensionality, as neigborhood distances are forced to be small and constant, due to the Riemannian metric.But are the local metrics not incompatible?
Yes, they are. But recall that for our use-case, simplicial complexes are essentially graphs. To overcome this issue, we can make our graph directed with mutual weights assigned to pairs of edges between neighborhood points.How do we find the low-dimensional topological embedding?
From the manifold assumption about the data, we know that the data is located on the low-dimensional Euclidean space, which is what, we are trying to embed. This provides us a global Euclidean distance measure. Now, we need to match our fuzzy topology with a good low-dimensional topology and this can be taken as an Optimization problem. UMAP chooses Binary Cross Entropy to measure the dissimilarity between the two representations. Its mathematical form is given below: \[ \boxed{\sum_{e \in E} w_h(e) \log\left(\frac{w_h(e)}{w_l(e)}\right) + (1 - w_h(e))\log\left(\frac{1 - w_h(e)}{1 - w_l(e)}\right)} \] Here, \(E\) is the edge-set and \(w_h\) and \(w_l\) represent the weights in the high and low dimensional cases, respectively. The first term provides an "attractive force" across an edge, whenever there is a larger weight associated to the high dimensional case, because for large \(w_l\), this term will be minimized, which signifies the smallest possible distance between datapoints. On the other hand, the second term provides a "repulsive force", that is minimized by smaller values for \(w_l\), implying a larger distance between datapoints. So, in a sense, the first term helps group related data (local structure), while the second term helps in spacing out unrelated data (global structure).Now let us take some real world datasets and apply UMAP to better understand the hyperparameters of this algorithm.
For this part, we will make use of umap-learn
, a Python library from the authors of the original paper. To run the code on your machine, please go through the requirements of the package and then install it through conda
or python -m pip install umap-learn[plot]
. Here, [plot]
installs the dependencies for the plotting module. Also, all the code snippets here can be found in a Jupyter Notebook.
Now, let us import the required libraries and load or fetch some datasets - Digits, MNIST and FashionMNIST. We will be using the latter two for showcasing how UMAP would embed them into a 2D space, while the analysis of UMAP hyperparameters would be done on the smaller Digits dataset. Also, note that, while umap-learn
supports supervised learning through an optional parameter, we will be focusing on the unsupervised aspect.
import umap as u
# umap.plot may raise errors as of umap-learn-0.4.6
from umap import plot as uplot
import matplotlib.pyplot as plt
from sklearn import datasets
# Datasets
digits = datasets.load_digits()
mnist = datasets.fetch_openml('mnist_784')
fmnist = datasets.fetch_openml('Fashion-MNIST')
Having loaded in the datasets, let us define a small function to help visualize what these datasets look like.
def plot_data(data):
"""
Plots part of the data (100 datapoints)
Adapted from: https://umap-learn.readthedocs.io/en/latest/basic_usage.html
Parameters
----------
data: numpy.ndarray
Image (2D) data to plot
"""
fig, ax_array = plt.subplots(10, 10)
axes = ax_array.flatten()
for i, ax in enumerate(axes):
if data[i].ndim != 2:
size = int(np.sqrt(data[i].shape[0]))
ax.imshow(data[i].reshape((size, size)), cmap='gray_r')
else:
ax.imshow(data[i], cmap='gray_r')
plt.setp(axes, xticks=[], yticks=[], frame_on=False)
plt.tight_layout(pad=0., h_pad=0.1, w_pad=0.)
plot_data(digits.images)
plot_data(mnist.data)
plot_data(fmnist.data)
umap
provides a class, UMAP
, that has fit()
and fit_transform()
methods to create an embedding in low-dimensional space and to return the embedded data, respectively. To visualize the effect of UMAP, we will use UMAP().fit()
(with default hyperparameters) on MNIST and FashionMNIST and plot the output as an interactive graph. Since, bokeh
, the python library used for interactive plotting, does not support an arbitrarily large number of labels, we will be using a subset of the datasets.
# MNIST: 30000
hover_data_mnist = pd.DataFrame({'index':np.arange(30000), 'label':mnist.target[:30000]})
hover_data_mnist['item'] = hover_data_mnist.label.map(
{
'0':'0',
'1':'1',
'2':'2',
'3':'3',
'4':'4',
'5':'5',
'6':'6',
'7':'7',
'8':'8',
'9':'9',
}
)
labels_mnist = mnist.target[:30000]
p_mnist30k = uplot.interactive(map_mnist30k, labels=labels_mnist, hover_data=hover_data_mnist, point_size=2, theme="fire")
# Styling code is in Jupyter Notebook
uplot.show(p_mnist30k)
# FashionMNIST: 30000
hover_data_fmnist = pd.DataFrame({'index':np.arange(30000), 'label':fmnist.target[:30000]})
hover_data_fmnist['item'] = hover_data_fmnist.label.map(
{
'0':'T-Shirt/top',
'1':'Trouser',
'2':'Pullover',
'3':'Dress',
'4':'Coat',
'5':'Sandal',
'6':'Shirt',
'7':'Sneaker',
'8':'Bag',
'9':'Ankle Boot',
}
)
labels_fmnist = fmnist.target[:30000]
p_fmnist30k = uplot.interactive(map_fmnist30k, labels=labels_fmnist, hover_data=hover_data_fmnist, point_size=2, theme="fire")
# Styling code is in Jupyter Notebook
uplot.show(p_fmnist30k)
map_mnist = u.UMAP().fit(mnist.data)
map_fmnist = u.UMAP().fit(fmnist.data)
uplot.points(map_mnist, labels=mnist.target, width=400, height=400, theme="fire")
uplot.points(map_fmnist, labels=fmnist.target, width=400, height=400, theme="fire")
# Styling code is in Jupyter Notebook
min_dist
and spread
. As the names suggest, these control the minimum distance between the points (local) and the overall spread of the data (global), respectively, with min_dist
\(\le\) spread
. Here, we will keep spread
fixed at its default value of \(1.0\) and change only n_neighbors
and min_dist
, as these two effectively control the balance between local and global structures.
n_neighbors
\(= 15\) and min_dist
\(= 0.1\)):
hover_data_digits = pd.DataFrame({'index':digits.data.shape[0], 'label':digits.target})
labels_digits = digits.target
p_digits = uplot.interactive(map_digits, labels=labels_digits, hover_data=hover_data_digits, point_size=2, theme="fire")
# Styling code is in Jupyter Notebook
uplot.show(p_digits)
n_neighbors
and min_dist
values and plot them together.
import itertools
n_n = [5, 30]
m_d = [0.1, 0.9]
combinations = list(itertools.product(n_n, m_d))
for n, md in combinations:
embed = vary_hyperparameters(n_neighbors=n, min_dist=md)
uplot.connectivity(embed, labels=digits.target, show_points=True, width=400, height=400, background="black", color_key_cmap="rainbow", edge_cmap="seismic_r")
n_neighbors
is used to construct the high-dimensional graph, thus controlling how local vs global structures are balanced. As with kNN, lower values of \(n\) imply that UMAP will be constrained to consider local information in fine detail, while foregoing most of the global information and vice-versa.min_dist
controls how tightly points are clumped together in the low-dimensional space. Larger values imply that UMAP will prioritize the overall structure with less emphasis on local clustering.umap
GitHub Repository:
UMAP.
umap
- Read The Docs.
umap
- Read The Docs.