1. Introduction

Uniform Manifold Approximation and Projection (UMAP) belongs to a class of algorithms known as dimensionality reduction methods. They map high-dimensional data into lower-dimensional spaces.

In this tutorial, we’ll review UMAP and how to interpret UMAP plots.

2. Overview of Manifolds

Before we delve into the inner workings of UMAP, let’s briefly review manifolds.

In simple terms, a manifold is a topological space that resembles Euclidean space:

A diagram showing manifold structures

So, each data point in the space belongs to a neighborhood where the distance between neighboring points is measurable. Measurability means we can compute how far points are from each other in a neighborhood.

3. What Is UMAP?

Leland McInnes, John Healy, and James Melville first introduced UMAP as an alternative method to t-SNE for dimensionality reduction. According to its developers, UMAP yields better results than t-SNE in visualizing data’s inherent structure.

Consequently, the algorithm has two core components: manifold approximations and fuzzy simplicial set representations. Manifold approximations refer to finding the manifold space in which the high-dimensional data lies.

A simplicial set representation provides a simplified way to represent data, connecting neighboring points with graph edges. Introducing the concept of “fuzzy” means that the connections between data points are more vague and flexible, so no point is strictly connected to just one point.

3.1. Pseudocode

Let X = \{x_1, x_2, \ldots, x_n\} be the dataset in the original high-dimensional space.

First, we approximate the manifold with the neighborhood graph M in which X lies. Then, we construct a fuzzy simplicial set representation, F, based on M.

Once constructed, the graph is mapped onto a lower dimensional space Y = \{y_1, y_2, \ldots, y_n\}, with usually 2 or 3 dimensions. In Y, every Y_i represents a point X_i in X.

UMAP then optimizes X and Y by minimizing the cross-entropy between the two.

Here’s the pseudocode:

algorithm UMAP(X): 
    // INPUT 
    //    X = {x_1, x_2, ...., x_n} = the original data 
    // OUTPUT 
    //    Y= {y_1, y_2, ...., y_n} = the transformed data
   
    D <- calculate pairwise distances between the elements of X 
    M <- construct the neighborhood graph, using D 
    F <- using M, construct a fuzzy simplicial set, the representation of the high-dimensional data
    Y <- an initial random representation of F in a low-dimensional space
    while not converged:
        update Y by minimizing the cross entropy of F and Y
    return Y

3.2. Approximating a Manifold

A manifold describes the underlying structure of the high-dimensional data.

To approximate the manifold, we first construct a neighborhood graph. We do this by computing pairwise distances between the data points in the high-dimensional space:

Diagram pairwise distances between datapoints

Subsequently, we identify the nearest neighbors for each data point based on the calculated distances and the predefined number of neighbors to consider. Then, we identify the neighborhoods to which each data point belongs by connecting data points to their nearest neighbors, thus creating a neighborhood graph.

3.3. Constructing a Fuzzy Simplicial Complex

Similarly, another important step is to construct a fuzzy simplicial complex. The neighborhood graph is transformed into a fuzzy simplicial complex based on the distances between the individual data points.

The simplicial set allows a representation of the data that captures the relationships between all the data points. This is achieved through fuzzy connections, which allow flexible connections between individual data points.

3.4. Transforming to a Low-Dimensional Space

Once the high-dimensional is encoded in the fuzzy simplicial set, it can be mapped to a lower-dimensional space. We start with an initial random lower-dimensional representation, Y, of the high-dimensional data, X. Subsequently, we compute and minimize the cross entropy loss between X and Y.

We do this iteratively until we obtain a representation in  Y that maintains the topological structure of the data in X.

4. The Hyperparameters of UMAP

To optimize UMAP’s performance, we must tune four main hyperparameters: the number of neighbors, the minimum distance, the number of components, and the metric.

Firstly, the number of neighbors refers to the number of data points to consider when constructing a neighborhood. A few neighbors emphasize the local structures, while many emphasize the global structure.

The minimum distance describes the minimum distance required to consider two points as neighbors.

The number of components refers to the dimensionality of the space to which we reduce the initial high-dimensional data. This target space has fewer dimensions than the original one, usually 2 or 3.

Lastly, the metric hyperparameter determines the method used to calculate the distance between data points. In most cases, it’s either the Euclidean or Manhattan distances.

5. How Do We Interpret a UMAP Plot?

UMAP emphasizes the data’s global structure while preserving the dataset’s local relationships.

Transforming data into a lower-dimensional space (2D or 3D) makes it interpretable to the human eye. So, UMAP enables the interpretation and visualization of high-dimensional data.

5.1. A Random Dataset Drawn From a Gaussian Distribution

Let’s consider a random dataset drawn from a Gaussian distribution with 250 features and 250 data points.

We applied UMAP with the hyperparameters: number of components=2, number of neighbors=25, minimum distance=0.5, and the Euclidean distance. The result is:

UMAP0

The UMAP plot shows the data points spread across a 2D space. We detect a single global structure to which all the data points belong. This suggests all the data points are related, as there are no separate clusters.

5.2. A Random Dataset Drawn From Two Gaussian Distributions

Let’s consider another dataset with 250 features and 250 data points from two Gaussian distributions having different means.

We apply UMAP with the same hyperparameters as in the previous example:

UMAP1

Now, we see two main structures depicting the two classes present in the data: the two distributions from which the data was drawn.

6. Conclusion

In this article, we covered the basics of UMAP and how to interpret UMAP plots. It’s a popular algorithm for data analysis and dimensionality reduction.

UMAP combines manifold approximation and simplicial sets to find topological structures to map high-dimensional data into low-dimensional and more easily interpretable spaces.


原始标题:How Does UMAP Dimensionality Reduction Work?