API Reference

class multi_mst.KMST(num_neighbors: int = 3, min_samples: int = 1, epsilon: float | None = None)

An SKLEARN-style estimator for computing a k-MST of a dataset. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance).

See MultiMSTMixin for inherited methods.

Parameters:
num_neighbors: int, optional

The number of edges to connect between each fragment. Default is 3.

min_samples: int, optional

The number of neighbors to use for computing core distances. Default is 1.

epsilon: float, optional

A fraction of the initial MST edge distance to act as upper distance bound.

Attributes:
graph_scipy.sparse.csr_array

The computed k-minimum spanning tree as sparse matrix with raw distance edge weights. Rows are sorted in ascending distance.

mutual_reachability_graph_scipy.sparse.csr_array

The computed k-minimum spanning tree as sparse matrix with mutual reachability edge weights. Rows are sorted in ascending distance.

minimum_spanning_tree_numpy.ndarray, shape (n_points - 1, 3)

A minimum spanning tree edgelist with raw distances (unsorted).

mutual_reachability_tree_numpy.ndarray, shape (n_points - 1, 3)

A minimum spanning tree edgelist with mutual reachability distances (unsorted).

knn_neighbors_numpy.ndarray, shape (n_samples, num_neighbors)

The kNN indices of the input data.

knn_distances_numpy.ndarray, shape (n_samples, num_neighbors)

The kNN (raw) distances of the input data.

kmst_neighbors_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format, -1 marks invalid indices.

kmst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST (raw) distances in kNN format.

fit(X, y=None, **fit_params)

Computes the k-MST of the given data.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
self: KMST

The fitted estimator.

class multi_mst.KMSTDescent(metric: Callable | str = 'euclidean', metric_kwds: dict | None = None, num_neighbors: int = 3, min_samples: int = 1, epsilon: float | None = None, min_descent_neighbors: int = 12, nn_kwargs: dict | None = None)

An SKLEARN-style estimator for computing approximate k-MSTs of a dataset. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance).

See MultiMSTMixin for inherited methods.

Parameters:
metric: string or callable (optional, default=’euclidean’)

The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. See the pynndescent docs for supported metrics. Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary.

metric_kwds: dict (optional, default {})

Arguments to pass on to the metric, such as the p value for Minkowski distance. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.

num_neighbors: int, optional

The number of edges to connect between each fragment. Default is 3.

min_samples: int, optional

The number of neighbors to use for computing core distances. Default is 1.

epsilon: float, optional

A fraction of the initial MST edge distance to act as upper distance bound.

min_descent_neighbors: int, optional

Runs the descent algorithm with more neighbors than we retain in the MST to improve convergence when num_neighbors is low. Default is 12.

nn_kwargs: dict

Additional keyword arguments to pass to NNDescent.

Attributes:
graph_scipy.sparse.csr_array

The computed k-minimum spanning tree as sparse matrix with raw distance edge weights. Rows are sorted in ascending distance.

mutual_reachability_graph_scipy.sparse.csr_array

The computed k-minimum spanning tree as sparse matrix with mutual reachability edge weights. Rows are sorted in ascending distance.

minimum_spanning_tree_numpy.ndarray, shape (n_points - 1, 3)

A minimum spanning tree edgelist with raw distances (unsorted).

mutual_reachability_tree_numpy.ndarray, shape (n_points - 1, 3)

A minimum spanning tree edgelist with mutual reachability distances (unsorted).

knn_neighbors_numpy.ndarray, shape (n_samples, num_neighbors)

The kNN indices of the input data.

knn_distances_numpy.ndarray, shape (n_samples, num_neighbors)

The kNN (raw) distances of the input data.

kmst_neighbors_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format, -1 marks invalid indices.

kmst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST (raw) distances in kNN format.

fit(X, y=None, **fit_params)

Computes the k-MST of the given data.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
self: KMSTDescent

The fitted estimator.

class multi_mst.NoisyMST(num_neighbors: int = 3, min_samples: int = 1, noise_fraction: float = 0.1)

An SKLEARN-style estimator for computing a union of k noisy MSTs for the given data. Adapts the boruvka algorithm construct multiple noisy minimum spanning trees.

See MultiMSTMixin for inherited methods.

Parameters:
num_neighbors: int, optional

The number of noisy MSTS to create. Default is 3.

min_samples: int, optional

The number of neighbors to use for computing core distances. Default is 1.

noise_fraction:

Adds Gaussian noise with scale=noise_fraction * max core distance to every computed distance value.

Attributes:
graph_scipy.sparse.csr_array

The computed k-minimum spanning tree as sparse matrix with raw distance edge weights. Rows are sorted in ascending distance.

mutual_reachability_graph_scipy.sparse.csr_array

The computed k-minimum spanning tree as sparse matrix with mutual reachability edge weights. Rows are sorted in ascending distance.

minimum_spanning_tree_numpy.ndarray, shape (n_points - 1, 3)

A minimum spanning tree edgelist with raw distances (unsorted).

mutual_reachability_tree_numpy.ndarray, shape (n_points - 1, 3)

A minimum spanning tree edgelist with mutual reachability distances (unsorted).

knn_neighbors_numpy.ndarray, shape (n_samples, num_neighbors)

The kNN indices of the input data.

knn_distances_numpy.ndarray, shape (n_samples, num_neighbors)

The kNN (raw) distances of the input data.

kmst_neighbors_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format, -1 marks invalid indices.

kmst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST (raw) distances in kNN format.

fit(X, y=None, **fit_params)

Computes the k-MST of the given data.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
self: KMST

The fitted estimator.

class multi_mst.base.MultiMSTMixin(metric='euclidean', metric_kwds=None)

A base class implementing shared functionality for multi spanning tree classes.

Attributes:
graph_scipy.sparse.csr_array

The computed k-minimum spanning tree as sparse matrix with raw distance edge weights. Rows are sorted in ascending distance.

mutual_reachability_graph_scipy.sparse.csr_array

The computed k-minimum spanning tree as sparse matrix with mutual reachability edge weights. Rows are sorted in ascending distance.

minimum_spanning_tree_numpy.ndarray, shape (n_points - 1, 3)

A minimum spanning tree edgelist with raw distances (unsorted).

mutual_reachability_tree_numpy.ndarray, shape (n_points - 1, 3)

A minimum spanning tree edgelist with mutual reachability distances (unsorted).

knn_neighbors_numpy.ndarray, shape (n_samples, num_neighbors)

The kNN indices of the input data.

knn_distances_numpy.ndarray, shape (n_samples, num_neighbors)

The kNN (raw) distances of the input data.

graph_neighbors_numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format, -1 marks invalid indices.

graph_distances_numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST (raw) distances in kNN format.

boundary_cluster_detector(clusterer, cluster_labels=None, cluster_probabilities=None, sample_weights=None, *, num_hops: int = 2, hop_type: Literal['manifold', 'metric'] = 'manifold', boundary_connectivity: Literal['knn', 'core'] = 'knn', boundary_use_reachability: bool = True, min_cluster_size: int | None = None, max_cluster_size: int | None = None, allow_single_cluster: bool | None = None, cluster_selection_method: Literal['eom', 'leaf'] | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0)

Constructs a BoundaryClusterDetector, ensuring valid parameter–metric combinations.

Parameters:
clustererHDBSCAN | HBCC

The fitted HDBSCAN or HBCC model to use for branch detection.

cluster_labelsnp.ndarray, shape (n_samples, ), optional (default=None)

Override cluster labels for each point in the data set. If not provided, the clusterer’s labels will be used. Clusters must be connected in the minimum spanning tree. Otherwise, the branch detector will return connected component labels for that cluster.

cluster_probabilitiesnp.ndarray, shape (n_samples, ), optional (default=None)

Override cluster probabilities for each point in the data set. If not provided, the clusterer’s probabilities will be used, or all points will be given 1.0 probability if cluster_labels are overridden.

sample_weightsnp.ndarray, shape (n_samples, ), optional (default=None)

Data point weights used to adapt cluster size.

num_hops: int, default=2

The number of hops used to expand the boundary coefficient connectivity graph.

hop_type: ‘manifold’ or ‘metric’, default=’manifold’

The type of hop expansion to use. Manifold adds edge distances on traversal, metric computes distance between visited points.

boundary_connectivity: ‘knn’ or ‘core’, default=’knn’

Which graph to compute the boundary coefficient on. ‘knn’ uses the k-nearest neighbors graph, ‘core’ uses the knn–mst union graph.

boundary_use_reachability: boolean, default=False

Whether to use mutual reachability or raw distances for the boundary coefficient computation.

min_cluster_sizeint, optional (default=5)

The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.

allow_single_clusterbool, optional (default=False)

By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.

cluster_selection_methodstring, optional (default=’eom’)

The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

  • eom

  • leaf

cluster_selection_epsilon: float, optional (default=0.0)

A distance threshold. Clusters below this value will be merged. This is the minimum epsilon allowed.

cluster_selection_persistence: float, optional (default=0.0)

A persistence threshold. Clusters with a persistence lower than this value will be merged.

Returns:
clustererBoundaryClusterDetector

A fitted BoundaryClusterDetector.

branch_detector(clusterer, cluster_labels=None, cluster_probabilities=None, sample_weights=None, *, label_sides_as_branches: bool = False, min_cluster_size: int | None = None, max_cluster_size: int | None = None, allow_single_cluster: bool | None = None, cluster_selection_method: Literal['eom', 'leaf'] | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0, propagate_labels: bool = False)

Constructs and fits a metric-aware BranchDetector, ensuring valid parameter–metric combinations.

Parameters:
clustererHDBSCAN | HBCC

The fitted HDBSCAN or HBCC model to use for branch detection.

cluster_labelsnp.ndarray, shape (n_samples, ), optional (default=None)

Override cluster labels for each point in the data set. If not provided, the clusterer’s labels will be used. Clusters must be connected in the minimum spanning tree. Otherwise, the branch detector will return connected component labels for that cluster.

cluster_probabilitiesnp.ndarray, shape (n_samples, ), optional (default=None)

Override cluster probabilities for each point in the data set. If not provided, the clusterer’s probabilities will be used, or all points will be given 1.0 probability if cluster_labels are overridden.

sample_weightsnp.ndarray, shape (n_samples, ), optional (default=None)

Data point weights used to adapt cluster size.

label_sides_as_branches: bool, default=False

Controls the minimum number of branches in a cluster for the branches to be labelled. When True, the branches are labelled if there are more than one branch in a cluster. When False, the branches are labelled if there are more than two branches in a cluster.

min_cluster_sizeint, optional (default=5)

The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.

allow_single_clusterbool, optional (default=False)

By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.

cluster_selection_methodstring, optional (default=’eom’)

The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

  • eom

  • leaf

cluster_selection_epsilon: float, optional (default=0.0)

A distance threshold. Clusters below this value will be merged. This is the minimum epsilon allowed.

cluster_selection_persistence: float, optional (default=0.0)

A persistence threshold. Clusters with a persistence lower than this value will be merged.

propagate_labels: bool, optional (default=False)

Whether to fill in noise labels with (repeated) majority vote branch labels.

Returns:
clustererBranchDetector

A fitted BranchDetector.

fit(X, y=None, **fit_params)

Manages the infinite data handling.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
self: MultiMSTMixin

The fitted estimator.

hbcc(data_labels=None, sample_weights=None, *, num_hops: int = 2, min_cluster_size: int = 25, max_cluster_size: float = inf, hop_type: Literal['manifold', 'metric'] = 'manifold', boundary_connectivity: Literal['knn', 'core'] = 'knn', boundary_use_reachability: bool = True, cluster_selection_method: Literal['eom', 'leaf'] = 'eom', allow_single_cluster: bool = False, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0, ss_algorithm: Literal['bc', 'bc_simple'] = 'bc')

Constructs and fits an HBCC model to the kMST graph.

Parameters:
data_labelsarray-like, shape (n_samples,), optional (default=None)

Labels for semi-supervised clustering. If provided, the model will be semi-supervised and will use the provided labels to guide the clustering process.

sample_weightsarray-like, shape (n_samples,), optional (default=None)

Data point weights used to adapt cluster size.

num_hops: int, default=2

The number of hops used to expand the boundary coefficient connectivity graph.

hop_type: ‘manifold’ or ‘metric’, default=’manifold’

The type of hop expansion to use. Manifold adds edge distances on traversal, metric computes distance between visited points.

boundary_connectivity: ‘knn’ or ‘core’, default=’knn’

Which graph to compute the boundary coefficient on. ‘knn’ uses the k-nearest neighbors graph, ‘core’ uses the knn–mst union graph.

boundary_use_reachability: boolean, default=False

Whether to use mutual reachability or raw distances for the boundary coefficient computation.

min_cluster_sizeint, optional (default=5)

The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.

cluster_selection_methodstring, optional (default=’eom’)

The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

  • eom

  • leaf

allow_single_clusterbool, optional (default=False)

By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.

cluster_selection_epsilon: float, optional (default=0.0)

A distance threshold. Clusters below this value will be merged. This is the minimum epsilon allowed.

cluster_selection_persistence: float, optional (default=0.0)

A persistence threshold. Clusters with a persistence lower than this value will be merged.

ss_algorithm: string, optional (default=’bc’)
The semi-supervised clustering algorithm to use. Valid options are:
  • bc

  • bc_simple

Returns:
clustererHDBSCAN

The fitted HDBSCAN model.

hdbscan(data_labels=None, sample_weights=None, *, min_cluster_size: int = 25, max_cluster_size: float = inf, allow_single_cluster: bool = False, cluster_selection_method: Literal['eom', 'leaf'] = 'eom', cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0, ss_algorithm: Literal['bc', 'bc_simple'] = 'bc')

Constructs and fits an HDBSCAN model to the kMST graph.

Parameters:
data_labelsarray-like, shape (n_samples,), optional (default=None)

Labels for semi-supervised clustering. If provided, the model will be semi-supervised and will use the provided labels to guide the clustering process.

sample_weightsarray-like, shape (n_samples,), optional (default=None)

Data point weights used to adapt cluster size.

min_cluster_sizeint, optional (default=5)

The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.

cluster_selection_methodstring, optional (default=’eom’)

The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

  • eom

  • leaf

allow_single_clusterbool, optional (default=False)

By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.

cluster_selection_epsilon: float, optional (default=0.0)

A distance threshold. Clusters below this value will be merged. This is the minimum epsilon allowed.

cluster_selection_persistence: float, optional (default=0.0)

A persistence threshold. Clusters with a persistence lower than this value will be merged.

ss_algorithm: string, optional (default=’bc’)
The semi-supervised clustering algorithm to use. Valid options are:
  • bc

  • bc_simple

Returns:
clustererHDBSCAN

The fitted HDBSCAN model.

remap_indices()

Remaps the indices of the kNN and kMST graphs to the original raw data.

umap(*, n_components: int = 2, output_metric: Callable | str = 'euclidean', output_metric_kwds: dict | None = None, n_epochs: int | None = None, learning_rate: float = 1.0, init: str | Any = 'spectral', min_dist: float = 0.1, spread: float = 1.0, set_op_mix_ratio: float = 1.0, local_connectivity: float = 1.0, repulsion_strength: float = 1.0, negative_sample_rate: int = 5, a: float | None = None, b: float | None = None, random_state: int | Any | None = None, target_n_neighbors: int = -1, target_metric: Callable | str = 'categorical', target_metric_kwds: dict | None = None, target_weight: float = 0.5, transform_seed: int = 42, transform_mode: Literal['embedding', 'graph'] = 'embedding', verbose: bool = False, tqdm_kwds: dict | None = None, densmap: bool = False, dens_lambda: float = 2.0, dens_frac: float = 0.3, dens_var_shift: float = 0.1, output_dens: bool = False, disconnection_distance: float | None = None)

Constructs and fits a UMAP model to the kMST graph.

Unlike HDBSCAN and HBCC, UMAP does not support infinite data. To ensure all UMAP’s member functions work as expected, the UMAP model is NOT remapped to the infinite data after fitting. As a result, combining UMAP and HDBSCAN results need to consider the finite index: ```

plt.scatter(*umap.embedding_.T, c=hdbscan.labels_[multi_mst.finite_index])

```

Parameters:
n_components: int (optional, default 2)

The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.

metric: string or function (optional, default ‘euclidean’)

The metric to use to compute distances in output dimensional space. If a string is passed it must match a valid predefined metric, see UMAP’s documentation for available options. If a general metric is required a function that takes two 1d arrays and returns a float can be provided. For performance purposes it is required that this be a numba jit’d function.

metric_kwds: dict (optional, default None)

Keyword arguments to pass on to the metric, such as the p value of Minkowski distance. If None then no arguments are passed on.

n_epochs: int (optional, default None)

The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).

learning_rate: float (optional, default 1.0)

The initial learning rate for the embedding optimization.

init: string (optional, default ‘spectral’)

How to initialize the low dimensional embedding. Options are:

  • ‘spectral’: use a spectral embedding of the fuzzy 1-skeleton

  • ‘random’: assign initial embedding positions at random.

  • ‘pca’: use the first n_components from PCA applied to the

    input data.

  • ‘tswspectral’: use a spectral embedding of the fuzzy

    1-skeleton, using a truncated singular value decomposition to “warm” up the eigensolver. This is intended as an alternative to the ‘spectral’ method, if that takes an excessively long time to complete initialization (or fails to complete).

  • A numpy array of initial embedding positions.

min_dist: float (optional, default 0.1)

The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out.

spread: float (optional, default 1.0)

The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are.

set_op_mix_ratio: float (optional, default 1.0)

Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity: int (optional, default 1)

The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.

repulsion_strength: float (optional, default 1.0)

Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.

negative_sample_rate: int (optional, default 5)

The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

a: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

b: float (optional, default None)

More specific parameters controlling the embedding. If None these values are set automatically as determined by min_dist and spread.

random_state: int, RandomState instance or None, optional (default:
None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

target_n_neighbors: int (optional, default -1)

The number of nearest neighbors to use to construct the target simplicial set. If set to -1 use the n_neighbors value.

target_metric: string or callable (optional, default ‘categorical’)

The metric used to measure distance for a target array is using supervised dimension reduction. By default this is ‘categorical’ which will measure distance in terms of whether categories match or are different. Furthermore, if semi-supervised is required target values of -1 will be treated as unlabelled under the ‘categorical’ metric. If the target array takes continuous values (e.g. for a regression problem) then metric of ‘l1’ or ‘l2’ is probably more appropriate.

target_metric_kwds: dict (optional, default None)

Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.

target_weight: float (optional, default 0.5)

weighting factor between data topology and target topology. A value of 0.0 weights predominantly on data, a value of 1.0 places a strong emphasis on target. The default of 0.5 balances the weighting equally between data and target.

transform_seed: int (optional, default 42)

Random seed used for the stochastic aspects of the transform operation. This ensures consistency in transform operations.

verbose: bool (optional, default False)

Controls verbosity of logging.

tqdm_kwds: dict (optional, default None)

Key word arguments to be used by the tqdm progress bar.

densmap: bool (optional, default False)

Specifies whether the density-augmented objective of densMAP should be used for optimization. Turning on this option generates an embedding where the local densities are encouraged to be correlated with those in the original space. Parameters below with the prefix ‘dens’ further control the behavior of this extension.

dens_lambda: float (optional, default 2.0)

Controls the regularization weight of the density correlation term in densMAP. Higher values prioritize density preservation over the UMAP objective, and vice versa for values closer to zero. Setting this parameter to zero is equivalent to running the original UMAP algorithm.

dens_frac: float (optional, default 0.3)

Controls the fraction of epochs (between 0 and 1) where the density-augmented objective is used in densMAP. The first (1 - dens_frac) fraction of epochs optimize the original UMAP objective before introducing the density correlation term.

dens_var_shift: float (optional, default 0.1)

A small constant added to the variance of local radii in the embedding when calculating the density correlation objective to prevent numerical instability from dividing by a small number

output_dens: float (optional, default False)

Determines whether the local radii of the final embedding (an inverse measure of local density) are computed and returned in addition to the embedding. If set to True, local radii of the original data are also included in the output for comparison; the output is a tuple (embedding, original local radii, embedding local radii). This option can also be used when densmap=False to calculate the densities for UMAP embeddings.

disconnection_distance: float (optional, default np.inf or maximal value
for bounded distances)

Disconnect any vertices of distance greater than or equal to disconnection_distance when approximating the manifold via our k-nn graph. This is particularly useful in the case that you have a bounded metric. The UMAP assumption that we have a connected manifold can be problematic when you have points that are maximally different from all the rest of your data. The connected manifold assumption will make such points have perfect similarity to a random set of other points. Too many such points will artificially connect your space.

Returns:
umapUMAP

The fitted UMAP model.