API Reference
- class multi_mst.KMST(num_neighbors: int = 3, min_samples: int = 1, epsilon: float | None = None)
An SKLEARN-style estimator for computing a k-MST of a dataset. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance).
See MultiMSTMixin for inherited methods.
- Parameters:
- num_neighbors: int, optional
The number of edges to connect between each fragment. Default is 3.
- min_samples: int, optional
The number of neighbors to use for computing core distances. Default is 1.
- epsilon: float, optional
A fraction of the initial MST edge distance to act as upper distance bound.
- Attributes:
- graph_scipy.sparse.csr_array
The computed k-minimum spanning tree as sparse matrix with raw distance edge weights. Rows are sorted in ascending distance.
- mutual_reachability_graph_scipy.sparse.csr_array
The computed k-minimum spanning tree as sparse matrix with mutual reachability edge weights. Rows are sorted in ascending distance.
- minimum_spanning_tree_numpy.ndarray, shape (n_points - 1, 3)
A minimum spanning tree edgelist with raw distances (unsorted).
- mutual_reachability_tree_numpy.ndarray, shape (n_points - 1, 3)
A minimum spanning tree edgelist with mutual reachability distances (unsorted).
- knn_neighbors_numpy.ndarray, shape (n_samples, num_neighbors)
The kNN indices of the input data.
- knn_distances_numpy.ndarray, shape (n_samples, num_neighbors)
The kNN (raw) distances of the input data.
- kmst_neighbors_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format, -1 marks invalid indices.
- kmst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST (raw) distances in kNN format.
- fit(X, y=None, **fit_params)
Computes the k-MST of the given data.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- self: KMST
The fitted estimator.
- class multi_mst.KMSTDescent(metric: Callable | str = 'euclidean', metric_kwds: dict | None = None, num_neighbors: int = 3, min_samples: int = 1, epsilon: float | None = None, min_descent_neighbors: int = 12, nn_kwargs: dict | None = None)
An SKLEARN-style estimator for computing approximate k-MSTs of a dataset. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance).
See MultiMSTMixin for inherited methods.
- Parameters:
- metric: string or callable (optional, default=’euclidean’)
The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. See the pynndescent docs for supported metrics. Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary.
- metric_kwds: dict (optional, default {})
Arguments to pass on to the metric, such as the
p
value for Minkowski distance. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.- num_neighbors: int, optional
The number of edges to connect between each fragment. Default is 3.
- min_samples: int, optional
The number of neighbors to use for computing core distances. Default is 1.
- epsilon: float, optional
A fraction of the initial MST edge distance to act as upper distance bound.
- min_descent_neighbors: int, optional
Runs the descent algorithm with more neighbors than we retain in the MST to improve convergence when num_neighbors is low. Default is 12.
- nn_kwargs: dict
Additional keyword arguments to pass to NNDescent.
- Attributes:
- graph_scipy.sparse.csr_array
The computed k-minimum spanning tree as sparse matrix with raw distance edge weights. Rows are sorted in ascending distance.
- mutual_reachability_graph_scipy.sparse.csr_array
The computed k-minimum spanning tree as sparse matrix with mutual reachability edge weights. Rows are sorted in ascending distance.
- minimum_spanning_tree_numpy.ndarray, shape (n_points - 1, 3)
A minimum spanning tree edgelist with raw distances (unsorted).
- mutual_reachability_tree_numpy.ndarray, shape (n_points - 1, 3)
A minimum spanning tree edgelist with mutual reachability distances (unsorted).
- knn_neighbors_numpy.ndarray, shape (n_samples, num_neighbors)
The kNN indices of the input data.
- knn_distances_numpy.ndarray, shape (n_samples, num_neighbors)
The kNN (raw) distances of the input data.
- kmst_neighbors_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format, -1 marks invalid indices.
- kmst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST (raw) distances in kNN format.
- fit(X, y=None, **fit_params)
Computes the k-MST of the given data.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- self: KMSTDescent
The fitted estimator.
- class multi_mst.NoisyMST(num_neighbors: int = 3, min_samples: int = 1, noise_fraction: float = 0.1)
An SKLEARN-style estimator for computing a union of k noisy MSTs for the given data. Adapts the boruvka algorithm construct multiple noisy minimum spanning trees.
See MultiMSTMixin for inherited methods.
- Parameters:
- num_neighbors: int, optional
The number of noisy MSTS to create. Default is 3.
- min_samples: int, optional
The number of neighbors to use for computing core distances. Default is 1.
- noise_fraction:
Adds Gaussian noise with scale=noise_fraction * max core distance to every computed distance value.
- Attributes:
- graph_scipy.sparse.csr_array
The computed k-minimum spanning tree as sparse matrix with raw distance edge weights. Rows are sorted in ascending distance.
- mutual_reachability_graph_scipy.sparse.csr_array
The computed k-minimum spanning tree as sparse matrix with mutual reachability edge weights. Rows are sorted in ascending distance.
- minimum_spanning_tree_numpy.ndarray, shape (n_points - 1, 3)
A minimum spanning tree edgelist with raw distances (unsorted).
- mutual_reachability_tree_numpy.ndarray, shape (n_points - 1, 3)
A minimum spanning tree edgelist with mutual reachability distances (unsorted).
- knn_neighbors_numpy.ndarray, shape (n_samples, num_neighbors)
The kNN indices of the input data.
- knn_distances_numpy.ndarray, shape (n_samples, num_neighbors)
The kNN (raw) distances of the input data.
- kmst_neighbors_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format, -1 marks invalid indices.
- kmst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST (raw) distances in kNN format.
- fit(X, y=None, **fit_params)
Computes the k-MST of the given data.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- self: KMST
The fitted estimator.
- class multi_mst.base.MultiMSTMixin(metric='euclidean', metric_kwds=None)
A base class implementing shared functionality for multi spanning tree classes.
- Attributes:
- graph_scipy.sparse.csr_array
The computed k-minimum spanning tree as sparse matrix with raw distance edge weights. Rows are sorted in ascending distance.
- mutual_reachability_graph_scipy.sparse.csr_array
The computed k-minimum spanning tree as sparse matrix with mutual reachability edge weights. Rows are sorted in ascending distance.
- minimum_spanning_tree_numpy.ndarray, shape (n_points - 1, 3)
A minimum spanning tree edgelist with raw distances (unsorted).
- mutual_reachability_tree_numpy.ndarray, shape (n_points - 1, 3)
A minimum spanning tree edgelist with mutual reachability distances (unsorted).
- knn_neighbors_numpy.ndarray, shape (n_samples, num_neighbors)
The kNN indices of the input data.
- knn_distances_numpy.ndarray, shape (n_samples, num_neighbors)
The kNN (raw) distances of the input data.
- graph_neighbors_numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format, -1 marks invalid indices.
- graph_distances_numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST (raw) distances in kNN format.
- boundary_cluster_detector(clusterer, cluster_labels=None, cluster_probabilities=None, sample_weights=None, *, num_hops: int = 2, hop_type: Literal['manifold', 'metric'] = 'manifold', boundary_connectivity: Literal['knn', 'core'] = 'knn', boundary_use_reachability: bool = True, min_cluster_size: int | None = None, max_cluster_size: int | None = None, allow_single_cluster: bool | None = None, cluster_selection_method: Literal['eom', 'leaf'] | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0)
Constructs a BoundaryClusterDetector, ensuring valid parameter–metric combinations.
- Parameters:
- clustererHDBSCAN | HBCC
The fitted HDBSCAN or HBCC model to use for branch detection.
- cluster_labelsnp.ndarray, shape (n_samples, ), optional (default=None)
Override cluster labels for each point in the data set. If not provided, the clusterer’s labels will be used. Clusters must be connected in the minimum spanning tree. Otherwise, the branch detector will return connected component labels for that cluster.
- cluster_probabilitiesnp.ndarray, shape (n_samples, ), optional (default=None)
Override cluster probabilities for each point in the data set. If not provided, the clusterer’s probabilities will be used, or all points will be given 1.0 probability if cluster_labels are overridden.
- sample_weightsnp.ndarray, shape (n_samples, ), optional (default=None)
Data point weights used to adapt cluster size.
- num_hops: int, default=2
The number of hops used to expand the boundary coefficient connectivity graph.
- hop_type: ‘manifold’ or ‘metric’, default=’manifold’
The type of hop expansion to use. Manifold adds edge distances on traversal, metric computes distance between visited points.
- boundary_connectivity: ‘knn’ or ‘core’, default=’knn’
Which graph to compute the boundary coefficient on. ‘knn’ uses the k-nearest neighbors graph, ‘core’ uses the knn–mst union graph.
- boundary_use_reachability: boolean, default=False
Whether to use mutual reachability or raw distances for the boundary coefficient computation.
- min_cluster_sizeint, optional (default=5)
The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.
- allow_single_clusterbool, optional (default=False)
By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.
- cluster_selection_methodstring, optional (default=’eom’)
The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:
eom
leaf
- cluster_selection_epsilon: float, optional (default=0.0)
A distance threshold. Clusters below this value will be merged. This is the minimum epsilon allowed.
- cluster_selection_persistence: float, optional (default=0.0)
A persistence threshold. Clusters with a persistence lower than this value will be merged.
- Returns:
- clustererBoundaryClusterDetector
A fitted BoundaryClusterDetector.
- branch_detector(clusterer, cluster_labels=None, cluster_probabilities=None, sample_weights=None, *, label_sides_as_branches: bool = False, min_cluster_size: int | None = None, max_cluster_size: int | None = None, allow_single_cluster: bool | None = None, cluster_selection_method: Literal['eom', 'leaf'] | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0, propagate_labels: bool = False)
Constructs and fits a metric-aware BranchDetector [1], ensuring valid parameter–metric combinations.
- Parameters:
- clustererHDBSCAN | HBCC
The fitted HDBSCAN or HBCC model to use for branch detection.
- cluster_labelsnp.ndarray, shape (n_samples, ), optional (default=None)
Override cluster labels for each point in the data set. If not provided, the clusterer’s labels will be used. Clusters must be connected in the minimum spanning tree. Otherwise, the branch detector will return connected component labels for that cluster.
- cluster_probabilitiesnp.ndarray, shape (n_samples, ), optional (default=None)
Override cluster probabilities for each point in the data set. If not provided, the clusterer’s probabilities will be used, or all points will be given 1.0 probability if cluster_labels are overridden.
- sample_weightsnp.ndarray, shape (n_samples, ), optional (default=None)
Data point weights used to adapt cluster size.
- label_sides_as_branches: bool, default=False
Controls the minimum number of branches in a cluster for the branches to be labelled. When True, the branches are labelled if there are more than one branch in a cluster. When False, the branches are labelled if there are more than two branches in a cluster.
- min_cluster_sizeint, optional (default=5)
The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.
- allow_single_clusterbool, optional (default=False)
By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.
- cluster_selection_methodstring, optional (default=’eom’)
The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:
eom
leaf
- cluster_selection_epsilon: float, optional (default=0.0)
A distance threshold. Clusters below this value will be merged. This is the minimum epsilon allowed.
- cluster_selection_persistence: float, optional (default=0.0)
A persistence threshold. Clusters with a persistence lower than this value will be merged.
- propagate_labels: bool, optional (default=False)
Whether to fill in noise labels with (repeated) majority vote branch labels.
- Returns:
- clustererBranchDetector
A fitted BranchDetector.
References
[1]Bot D.M., Peeters J., Liesenborgs J., Aerts J. 2025. FLASC: a
flare-sensitive clustering algorithm. PeerJ Computer Science 11:e2792 https://doi.org/10.7717/peerj-cs.2792.
- fit(X, y=None, **fit_params)
Manages the infinite data handling.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- self: MultiMSTMixin
The fitted estimator.
- graphviz_layout(prog='sfdp', **kwargs)
Computes a layout for the graph using Graphviz.
Requires networkx and (py)graphviz to be installed and accessible on the system path.
- Parameters:
- progstr
The graphviz program to run.
- **kwargs
Additional arguments to networkx.nx_agraph.graphviz_layout.
- Returns:
- coordsndarray of shape (num_points, 2)
The coordinates of the nodes in the graph.
- hbcc(data_labels=None, sample_weights=None, *, num_hops: int = 2, min_cluster_size: int = 25, max_cluster_size: float = inf, hop_type: Literal['manifold', 'metric'] = 'manifold', boundary_connectivity: Literal['knn', 'core'] = 'knn', boundary_use_reachability: bool = True, cluster_selection_method: Literal['eom', 'leaf'] = 'eom', allow_single_cluster: bool = False, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0, ss_algorithm: Literal['bc', 'bc_simple'] = 'bc')
Constructs and fits an HBCC model to the kMST graph.
- Parameters:
- data_labelsarray-like, shape (n_samples,), optional (default=None)
Labels for semi-supervised clustering. If provided, the model will be semi-supervised and will use the provided labels to guide the clustering process.
- sample_weightsarray-like, shape (n_samples,), optional (default=None)
Data point weights used to adapt cluster size.
- num_hops: int, default=2
The number of hops used to expand the boundary coefficient connectivity graph.
- hop_type: ‘manifold’ or ‘metric’, default=’manifold’
The type of hop expansion to use. Manifold adds edge distances on traversal, metric computes distance between visited points.
- boundary_connectivity: ‘knn’ or ‘core’, default=’knn’
Which graph to compute the boundary coefficient on. ‘knn’ uses the k-nearest neighbors graph, ‘core’ uses the knn–mst union graph.
- boundary_use_reachability: boolean, default=False
Whether to use mutual reachability or raw distances for the boundary coefficient computation.
- min_cluster_sizeint, optional (default=5)
The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.
- cluster_selection_methodstring, optional (default=’eom’)
The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:
eom
leaf
- allow_single_clusterbool, optional (default=False)
By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.
- cluster_selection_epsilon: float, optional (default=0.0)
A distance threshold. Clusters below this value will be merged. This is the minimum epsilon allowed.
- cluster_selection_persistence: float, optional (default=0.0)
A persistence threshold. Clusters with a persistence lower than this value will be merged.
- ss_algorithm: string, optional (default=’bc’)
- The semi-supervised clustering algorithm to use. Valid options are:
bc
bc_simple
- Returns:
- clustererHDBSCAN
The fitted HDBSCAN model.
- hdbscan(data_labels=None, sample_weights=None, *, min_cluster_size: int = 25, max_cluster_size: float = inf, allow_single_cluster: bool = False, cluster_selection_method: Literal['eom', 'leaf'] = 'eom', cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0, ss_algorithm: Literal['bc', 'bc_simple'] = 'bc')
Constructs and fits an HDBSCAN [1] model to the kMST graph.
- Parameters:
- data_labelsarray-like, shape (n_samples,), optional (default=None)
Labels for semi-supervised clustering. If provided, the model will be semi-supervised and will use the provided labels to guide the clustering process.
- sample_weightsarray-like, shape (n_samples,), optional (default=None)
Data point weights used to adapt cluster size.
- min_cluster_sizeint, optional (default=5)
The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.
- cluster_selection_methodstring, optional (default=’eom’)
The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:
eom
leaf
- allow_single_clusterbool, optional (default=False)
By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.
- cluster_selection_epsilon: float, optional (default=0.0)
A distance threshold. Clusters below this value will be merged. This is the minimum epsilon allowed.
- cluster_selection_persistence: float, optional (default=0.0)
A persistence threshold. Clusters with a persistence lower than this value will be merged.
- ss_algorithm: string, optional (default=’bc’)
- The semi-supervised clustering algorithm to use. Valid options are:
bc
bc_simple
- Returns:
- clustererHDBSCAN
The fitted HDBSCAN model.
References
[1]McInnes L., Healy J. 2017. Accelerated Hierarchical Density Based
Clustering. IEEE International Conference on Data Mining Workshops (ICDMW), pp 33-42. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8215642.
- remap_indices()
Remaps the indices of the kNN and kMST graphs to the original raw data.
- tsne(*, n_components: int = 2, init: ndarray | spmatrix | Literal['random', 'pca'] = 'pca', learning_rate: float | Literal['auto'] = 'auto', early_exaggeration: float = 12.0, min_grad_norm: float = 1e-07, max_iter: int = 1000, n_iter_without_progress: int = 300, method: Literal['barnes_hut', 'exact'] = 'barnes_hut', angle: float = 0.5, random_state: RandomState | int | None = None, verbose: int = 0)
Constructs and fits a TSNE model [R9ff8b998a0ca-1] to the kMST graph.
Unlike HDBSCAN and HBCC, TSNE does not support infinite data. To ensure all TSNE’s member functions work as expected, the TSNE model is NOT remapped to the infinite data after fitting. As a result, combining TSNE and HDBSCAN results need to consider the finite index: ```
plt.scatter(*tsne.embedding_.T, c=hdbscan.labels_[multi_mst.finite_index])
- Parameters:
- n_componentsint, default=2
Dimension of the embedded space.
- init{“random”, “pca”} or array of shape (n_samples, n_components), default=”pca”
Initialization of embedding.
- learning_ratefloat or “auto”, default=”auto”
The learning rate for t-SNE is usually in the range [10.0, 1000.0]. If the learning rate is too high, the data may look like a ‘ball’ with any point approximately equidistant from its nearest neighbors. If the learning rate is too low, most points may look compressed in a dense cloud with few outliers. If the cost function gets stuck in a bad local minimum increasing the learning rate may help.
- early_exaggerationfloat, default=12.0
Controls how tight natural clusters in the original space are in the embedded space and how much space will be between them. For larger values, the space between natural clusters will be larger in the embedded space. The choice of this parameter is not very critical. If the cost function increases during initial optimization, the early exaggeration factor or the learning rate might be too high.
- min_grad_normfloat, default=1e-7
If the gradient norm is below this threshold, the optimization will be stopped.
- max_iterint, default=1000
Maximum number of iterations for the optimization. Should be at least 250.
- n_iter_without_progressint, default=300
Maximum number of iterations without progress before we abort the optimization, used after 250 initial iterations with early exaggeration. Note that progress is only checked every 50 iterations so this value is rounded to the next multiple of 50.
- method{‘barnes_hut’, ‘exact’}, default=’barnes_hut’
By default the gradient calculation algorithm uses Barnes-Hut approximation running in O(NlogN) time. method=’exact’ will run on the slower, but exact, algorithm in O(N^2) time. The exact algorithm should be used when nearest-neighbor errors need to be better than 3%. However, the exact method cannot scale to millions of examples.
- anglefloat, default=0.5
This is the trade-off between speed and accuracy for Barnes-Hut T-SNE. ‘angle’ is the angular size of a distant node as measured from a point. If this size is below ‘angle’ then it is used as a summary node of all points contained within it. This method is not very sensitive to changes in this parameter in the range of 0.2 - 0.8. Angle less than 0.2 has quickly increasing computation time and angle greater 0.8 has quickly increasing error.
- random_stateint, RandomState instance or None, default=None
Determines the random number generator. Pass an int for reproducible results across multiple function calls. Note that different initializations might result in different local minima of the cost function.
- verboseint, default=0
Verbosity level.
- Returns:
- tsneTSNE
The fitted TSNE model.
- umap(*, n_components: int = 2, output_metric: Callable | str = 'euclidean', output_metric_kwds: dict | None = None, n_epochs: int | None = None, learning_rate: float = 1.0, init: str | Any = 'spectral', min_dist: float = 0.1, spread: float = 1.0, set_op_mix_ratio: float = 1.0, local_connectivity: float = 1.0, repulsion_strength: float = 1.0, negative_sample_rate: int = 5, a: float | None = None, b: float | None = None, random_state: int | Any | None = None, target_n_neighbors: int = -1, target_metric: Callable | str = 'categorical', target_metric_kwds: dict | None = None, target_weight: float = 0.5, transform_seed: int = 42, transform_mode: Literal['embedding', 'graph'] = 'embedding', verbose: bool = False, tqdm_kwds: dict | None = None, densmap: bool = False, dens_lambda: float = 2.0, dens_frac: float = 0.3, dens_var_shift: float = 0.1, output_dens: bool = False, disconnection_distance: float | None = None)
Constructs and fits a UMAP model [1] to the kMST graph.
Unlike HDBSCAN and HBCC, UMAP does not support infinite data. To ensure all UMAP’s member functions work as expected, the UMAP model is NOT remapped to the infinite data after fitting. As a result, combining UMAP and HDBSCAN results need to consider the finite index: ```
plt.scatter(*umap.embedding_.T, c=hdbscan.labels_[multi_mst.finite_index])
- Parameters:
- n_components: int (optional, default 2)
The dimension of the space to embed into. This defaults to 2 to provide easy visualization, but can reasonably be set to any integer value in the range 2 to 100.
- n_epochs: int (optional, default None)
The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If None is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).
- learning_rate: float (optional, default 1.0)
The initial learning rate for the embedding optimization.
- init: string (optional, default ‘spectral’)
How to initialize the low dimensional embedding. Options are:
‘spectral’: use a spectral embedding of the fuzzy 1-skeleton
‘random’: assign initial embedding positions at random.
- ‘pca’: use the first n_components from PCA applied to the
input data.
- ‘tswspectral’: use a spectral embedding of the fuzzy
1-skeleton, using a truncated singular value decomposition to “warm” up the eigensolver. This is intended as an alternative to the ‘spectral’ method, if that takes an excessively long time to complete initialization (or fails to complete).
A numpy array of initial embedding positions.
- min_dist: float (optional, default 0.1)
The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the
spread
value, which determines the scale at which embedded points will be spread out.- spread: float (optional, default 1.0)
The effective scale of embedded points. In combination with
min_dist
this determines how clustered/clumped the embedded points are.- set_op_mix_ratio: float (optional, default 1.0)
Interpolate between (fuzzy) union and intersection as the set operation used to combine local fuzzy simplicial sets to obtain a global fuzzy simplicial sets. Both fuzzy set operations use the product t-norm. The value of this parameter should be between 0.0 and 1.0; a value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.
- local_connectivity: int (optional, default 1)
The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally. In practice this should be not more than the local intrinsic dimension of the manifold.
- repulsion_strength: float (optional, default 1.0)
Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples.
- negative_sample_rate: int (optional, default 5)
The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.
- a: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
.- b: float (optional, default None)
More specific parameters controlling the embedding. If None these values are set automatically as determined by
min_dist
andspread
.- random_state: int, RandomState instance or None, optional (default:
- None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- target_n_neighbors: int (optional, default -1)
The number of nearest neighbors to use to construct the target simplicial set. If set to -1 use the
n_neighbors
value.- target_metric: string or callable (optional, default ‘categorical’)
The metric used to measure distance for a target array is using supervised dimension reduction. By default this is ‘categorical’ which will measure distance in terms of whether categories match or are different. Furthermore, if semi-supervised is required target values of -1 will be treated as unlabelled under the ‘categorical’ metric. If the target array takes continuous values (e.g. for a regression problem) then metric of ‘l1’ or ‘l2’ is probably more appropriate.
- target_metric_kwds: dict (optional, default None)
Keyword argument to pass to the target metric when performing supervised dimension reduction. If None then no arguments are passed on.
- target_weight: float (optional, default 0.5)
weighting factor between data topology and target topology. A value of 0.0 weights predominantly on data, a value of 1.0 places a strong emphasis on target. The default of 0.5 balances the weighting equally between data and target.
- transform_seed: int (optional, default 42)
Random seed used for the stochastic aspects of the transform operation. This ensures consistency in transform operations.
- verbose: bool (optional, default False)
Controls verbosity of logging.
- tqdm_kwds: dict (optional, default None)
Key word arguments to be used by the tqdm progress bar.
- densmap: bool (optional, default False)
Specifies whether the density-augmented objective of densMAP should be used for optimization. Turning on this option generates an embedding where the local densities are encouraged to be correlated with those in the original space. Parameters below with the prefix ‘dens’ further control the behavior of this extension.
- dens_lambda: float (optional, default 2.0)
Controls the regularization weight of the density correlation term in densMAP. Higher values prioritize density preservation over the UMAP objective, and vice versa for values closer to zero. Setting this parameter to zero is equivalent to running the original UMAP algorithm.
- dens_frac: float (optional, default 0.3)
Controls the fraction of epochs (between 0 and 1) where the density-augmented objective is used in densMAP. The first (1 - dens_frac) fraction of epochs optimize the original UMAP objective before introducing the density correlation term.
- dens_var_shift: float (optional, default 0.1)
A small constant added to the variance of local radii in the embedding when calculating the density correlation objective to prevent numerical instability from dividing by a small number
- output_dens: float (optional, default False)
Determines whether the local radii of the final embedding (an inverse measure of local density) are computed and returned in addition to the embedding. If set to True, local radii of the original data are also included in the output for comparison; the output is a tuple (embedding, original local radii, embedding local radii). This option can also be used when densmap=False to calculate the densities for UMAP embeddings.
- disconnection_distance: float (optional, default np.inf or maximal value
- for bounded distances)
Disconnect any vertices of distance greater than or equal to disconnection_distance when approximating the manifold via our k-nn graph. This is particularly useful in the case that you have a bounded metric. The UMAP assumption that we have a connected manifold can be problematic when you have points that are maximally different from all the rest of your data. The connected manifold assumption will make such points have perfect similarity to a random set of other points. Too many such points will artificially connect your space.
- Returns:
- umapUMAP
The fitted UMAP model.
References
[1]McInnes, L., Healy, J. and Melville, J., 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.