API Reference
Noisy MST
- class multi_mst.noisy_mst.NoisyMST(*, num_trees=3, noise_fraction=0.1, min_samples=1, umap_kwargs=None)
An SKLEARN-style estimator for computing a union of k noisy MSTs for the given data. Adapts the boruvka algorithm construct multiple noisy miminum spanning trees.
The algorithm operates on HDBSCAN’s mutual reachability Euclidean distance. The resulting graph is embedded with UMAP as if it contains normal k nearest neighbors.
- Parameters:
- num_trees: int, optional
The number of minimum spanning trees created. Default is 3.
- noise_fraction:
Adds Gaussian noise with scale=noise_fraction * distance to every computed distance value.
- min_samples: int, optional
The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.
- umap_kwargs: dict
Additional keyword arguments to pass to UMAP.
- Attributes:
- graph_: scipy.sparse matrix
The computed k-minimum spanning tree as sparse matrix with edge weights as UMAP edge probability.
- embedding_: numpy.ndarray, shape (n_samples, num_components)
The low dimensional embedding of the data. None if transform_mode is ‘graph’.
- mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- _umap: umap.UMAP
A fitted UMAP object used to compute the graph and embedding. Observations with infinite or missing values are not passed to the UMAP algorithm. The object’s attributes are not adjusted to account for these missing values. Use the graph_ and embedding_ attributes instead!
- fit(X, y=None, **fit_params)
Computes the k-MST of the given data.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- self: KMST
The fitted estimator.
- fit_transform(X, y=None, **fit_params)
Computes the k-MST of the given data.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- embedding: numpy.ndarray, shape (n_samples, num_components)
The computed low dimensional embedding. None if transform_mode is ‘graph’.
- multi_mst.noisy_mst.noisyMST(data, num_trees=3, noise_fraction=0.1, min_samples=1, umap_kwargs=None)
Computes a union of k noisy MSTs for the given data. Adapts the boruvka algorithm construct multiple noisy miminum spanning trees.
The algorithm operates on HDBSCAN’s mutual reachability Euclidean distance. The resulting graph is embedded with UMAP as if it contains normal k nearest neighbors.
- Parameters:
- data: array-like
The data to construct a MST for.
- noise_fraction:
Adds Gaussian noise with scale=noise_fraction * distance to every computed distance value.
- num_trees: int, optional
The number of noisy MSTS to create. Default is 3.
- min_samples: int, optional
The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.
- umap_kwargs: dict
Additional keyword arguments to pass to UMAP.
- Returns:
- mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The noisyMST edges in kNN format.
- mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The noisyMST edges in kNN format.
- umap: umap.UMAP
A fitted UMAP object.
k-MST
- class multi_mst.k_mst.KMST(*, num_neighbors=3, min_samples=1, epsilon=None, umap_kwargs=None)
An SKLEARN-style estimator for computing a k-MST of a dataset. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance).
The algorithm operates on HDBSCAN’s mutual reachability Euclidean distance. The resulting graph is embedded with UMAP as if it contains normal k nearest neighbors.
- Parameters:
- num_neighbors: int, optional
The number of edges to connect between each fragement. Default is 35.
- min_samples: int, optional
The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.
- epsilon: float, optional
A fraction of the initial MST edge distance to act as upper distance bound.
- umap_kwargs: dict
Additional keyword arguments to pass to UMAP.
- Attributes:
- graph_: scipy.sparse matrix
The computed k-minimum spanning tree as sparse matrix with edge weights as UMAP edge probability.
- embedding_: numpy.ndarray, shape (n_samples, num_components)
The low dimensional embedding of the data. None if transform_mode is ‘graph’.
- mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- _umap: umap.UMAP
A fitted UMAP object used to compute the graph and embedding. Observations with infinite or missing values are not passed to the UMAP algorithm. The object’s attributes are not adjusted to account for these missing values. Use the graph_ and embedding_ attributes instead!
- fit(X, y=None, **fit_params)
Computes the k-MST of the given data.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- self: KMST
The fitted estimator.
- fit_transform(X, y=None, **fit_params)
Computes the k-MST of the given data.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- embedding: numpy.ndarray, shape (n_samples, num_components)
The computed low dimensional embedding. None if transform_mode is ‘graph’.
- multi_mst.k_mst.kMST(data, num_neighbors=3, min_samples=1, epsilon=None, umap_kwargs=None)
Computes a k-MST of a dataset. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance).
The algorithm operates on HDBSCAN’s mutual reachability Euclidean distance. The resulting graph is embedded with UMAP as if it contains normal k nearest neighbors.
- Parameters:
- data: array-like
The data to construct a MST for.
- num_neighbors: int, optional
The number of edges to connect between each fragement. Default is 3.
- min_samples: int, optional
The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.
- epsilon: float, optional
A fraction of the initial MST edge distance to act as upper distance bound.
- umap_kwargs: dict
Additional keyword arguments to pass to UMAP.
- Returns:
- mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- umap: umap.UMAP
A fitted UMAP object.
k-MST Descent
- class multi_mst.k_mst_descent.KMSTDescent(*, metric='euclidean', metric_kwds=None, num_neighbors=3, min_samples=1, epsilon=None, min_descent_neighbors=12, umap_kwargs=None, nn_kwargs=None, n_jobs=-1)
An SKLEARN-style estimator for computing approximate k-MSTs using NN-Descent. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance). Adapts NN-Descent to find MST edges, i.e., neighbours that are not already connected in the MST build up so far.
The algorithm operates on HDBSCAN’s mutual reachability distance, using a configurable distance metric. The result graph is embedded with UMAP as if it contains normal k-nearest neighbors.
- Parameters:
- data: array-like
The data to construct a MST for.
- metric: string or callable (optional, default=’euclidean’)
The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:
euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
mahalanobis
wminkowski
seuclidean
cosine
correlation
haversine
hamming
jaccard
dice
russelrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule
hellinger
wasserstein-1d
Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.
- metric_kwds: dict (optional, default {})
Arguments to pass on to the metric, such as the
p
value for Minkowski distance.- num_neighbors: int, optional
The number of edges to connect between each fragement. Default is 3.
- min_samples: int, optional
The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.
- epsilon: float, optional
A fraction of the initial MST edge distance to act as upper distance bound.
- min_descent_neighbors: int, optional
Runs the descent algorithm with more neighbors than we retain in the MST to improve convergence when num_neighbors is low. Default is 6.
- umap_kwargs: dict
Additional keyword arguments passed to UMAP.
- nn_kwargs: dict
Additional keyword arguments passsed to NNDescent.
- n_jobsint, optional
The number of threads to use for the computation. -1 means using all threads.
- Attributes:
- graph_: scipy.sparse matrix
The computed k-minimum spanning tree as sparse matrix with edge weights as UMAP edge probability.
- embedding_: numpy.ndarray, shape (n_samples, num_components)
The low dimensional embedding of the data. None if transform_mode is ‘graph’.
- mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- _umap: umap.UMAP
A fitted UMAP object used to compute the graph and embedding. Observations with infinite or missing values are not passed to the UMAP algorithm. The object’s attributes are not adjusted to account for these missing values. Use the graph_ and embedding_ attributes instead!
- fit(X, y=None, **fit_params)
Computes the k-MST of the given data.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- self: DescentMST
The fitted estimator.
- fit_transform(X, y=None, **fit_params)
Computes the k-MST of the given data.
- Parameters:
- X: array-like
The data to construct the MST for.
- y: array-like, optional
Ignored.
- **fit_params: dict
Ignored.
- Returns:
- embedding: numpy.ndarray, shape (n_samples, num_components)
The computed low dimensional embedding. None if transform_mode is ‘graph’.
- multi_mst.k_mst_descent.kMSTDescent(data, metric='euclidean', metric_kwds=None, num_neighbors=3, min_samples=1, epsilon=None, min_descent_neighbors=12, umap_kwargs=None, nn_kwargs=None, n_jobs=-1)
Computes an approximate k-MST using NN-Descent. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance). Adapts NN-Descent to find MST edges, i.e., neighbours that are not already connected in the MST build up so far.
The algorithm operates on HDBSCAN’s mutual reachability distance, using a configurable distance metric. The result graph is embedded with UMAP as if it contains normal k-nearest neighbors.
- Parameters:
- data: array-like
The data to construct a MST for.
- metric: string or callable (optional, default=’euclidean’)
The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:
euclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
mahalanobis
wminkowski
seuclidean
cosine
correlation
haversine
hamming
jaccard
dice
russelrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule
hellinger
wasserstein-1d
Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.
- metric_kwds: dict (optional, default {})
Arguments to pass on to the metric, such as the
p
value for Minkowski distance.- num_neighbors: int, optional
The number of edges to connect between each fragement. Default is 3.
- min_samples: int, optional
The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.
- epsilon: float, optional
A fraction of the initial MST edge distance to act as upper distance bound.
- min_descent_neighbors: int, optional
Runs the descent algorithm with more neighbors than we retain in the MST to improve convergence when num_neighbors is low. Default is 6.
- umap_kwargs: dict
Additional keyword arguments passed to UMAP.
- nn_kwargs: dict
Additional keyword arguments passsed to NNDescent.
- n_jobsint, optional
The number of threads to use for the computation. -1 means using all threads.
- Returns:
- mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- umap: umap.UMAP
A fitted UMAP object.
Recall logging k-MST Descent
- multi_mst.k_mst_descent_recall._KMSTDescentLogRecall
alias of
KMSTDescentLogRecall
- multi_mst.k_mst_descent_recall._kMSTDescentLogRecall(data, num_neighbors=3, min_samples=1, epsilon=None, min_descent_neighbors=12, umap_kwargs=None, nn_kwargs=None, n_jobs=-1)
Computes approximate k-MSTs using NN-Descent. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance). Adapts NN-Descent to find MST edges, i.e., neighbours that are not already connected in the MST build up so far.
This version keeps track of the recall between ground-truth and the approximate MST edges. Recall is computed in every descent iteration and in every Boruvka iteration.
The algorithm operates on HDBSCAN’s mutual reachability distance, using a configurable distance metric. The result graph is embedded with UMAP as if it contains normal k-nearest neighbors.
- Parameters:
- data: array-like
The data to construct a MST for.
- num_neighbors: int, optional
The number of edges to connect between each fragement. Default is 3.
- min_samples: int, optional
The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.
- epsilon: float, optional
A fraction of the initial MST edge distance to act as upper distance bound.
- min_descent_neighbors: int, optional
Runs the descent algorithm with more neighbors than we retain in the MST to improve convergence when num_neighbors is low. Default is 6.
- umap_kwargs: dict
Additional keyword arguments passed to UMAP.
- nn_kwargs: dict
Additional keyword arguments passsed to NNDescent.
- n_jobsint, optional
The number of threads to use for the computation. -1 means using all threads.
- Returns:
- mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)
The kMST edges in kNN format.
- umap: umap.UMAP
A fitted UMAP object.
- trace: list[dict]
A trace of the recall across Boruvka iterations (outer) and nn descent iterations (inner).