Versions list
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Latest version infos (V1)
Infos about this version
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data
- The data is uniformly distributed on Riemannian manifold;
- The Riemannian metric is locally constant (or can be approximated as such);
- The manifold is locally connected.
From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.
The details for the underlying mathematics can be found in our paper on ArXiv:
McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018
UMAP has several hyperparameters that can have a significant impact on the resulting embedding. In this notebook we will be covering the four major ones:
- n_neighbors
- This parameter controls how UMAP balances local versus global structure in the data. It does this by constraining the size of the local neighborhood UMAP will look at when attempting to learn the manifold structure of the data. This means that low values of
n_neighbors
will force UMAP to concentrate on very local structure (potentially to the detriment of the big picture), while large values will push UMAP to look at larger neighborhoods of each point when estimating the manifold structure of the data, losing fine detail structure for the sake of getting the broader of the data. - min_dist
- The
min_dist
parameter controls how tightly UMAP is allowed to pack points together. It, quite literally, provides the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values ofmin_dist
will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values ofmin_dist
will prevent UMAP from packing points together and will focus on the preservation of the broad topological structure instead. - n_components
- As is standard for many
scikit-learn
dimension reduction algorithms UMAP provides an_components
parameter option that allows the user to determine the dimensionality of the reduced dimension space we will be embedding the data into. Unlike some other visualisation algorithms such as t-SNE, UMAP scales well in the embedding dimension, so you can use it for more than just visualisation in 2- or 3-dimensions - Metric
Minkowski style metrics
- euclidean
- manhattan
- chebyshev
- minkowski
Miscellaneous spatial metrics
- canberra
- braycurtis
- haversine
Normalized spatial metrics
- mahalanobis
- wminkowski
- seuclidean
Angular and correlation metrics
- cosine
- correlation
Metrics for binary data
- hamming
- jaccard
- dice
- russellrao
- kulsinski
- rogerstanimoto
- sokalmichener
- sokalsneath
- yule
For more information : https://umap-learn.readthedocs.io/en/latest/parameters.html