API Reference

Noisy MST

class multi_mst.noisy_mst.NoisyMST(*, num_trees=3, noise_fraction=0.1, min_samples=1, umap_kwargs=None)

An SKLEARN-style estimator for computing a union of k noisy MSTs for the given data. Adapts the boruvka algorithm construct multiple noisy miminum spanning trees.

The algorithm operates on HDBSCAN’s mutual reachability Euclidean distance. The resulting graph is embedded with UMAP as if it contains normal k nearest neighbors.

Parameters:
num_trees: int, optional

The number of minimum spanning trees created. Default is 3.

noise_fraction:

Adds Gaussian noise with scale=noise_fraction * distance to every computed distance value.

min_samples: int, optional

The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.

umap_kwargs: dict

Additional keyword arguments to pass to UMAP.

Attributes:
graph_: scipy.sparse matrix

The computed k-minimum spanning tree as sparse matrix with edge weights as UMAP edge probability.

embedding_: numpy.ndarray, shape (n_samples, num_components)

The low dimensional embedding of the data. None if transform_mode is ‘graph’.

mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

_umap: umap.UMAP

A fitted UMAP object used to compute the graph and embedding. Observations with infinite or missing values are not passed to the UMAP algorithm. The object’s attributes are not adjusted to account for these missing values. Use the graph_ and embedding_ attributes instead!

fit(X, y=None, **fit_params)

Computes the k-MST of the given data.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
self: KMST

The fitted estimator.

fit_transform(X, y=None, **fit_params)

Computes the k-MST of the given data.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
embedding: numpy.ndarray, shape (n_samples, num_components)

The computed low dimensional embedding. None if transform_mode is ‘graph’.

multi_mst.noisy_mst.noisyMST(data, num_trees=3, noise_fraction=0.1, min_samples=1, umap_kwargs=None)

Computes a union of k noisy MSTs for the given data. Adapts the boruvka algorithm construct multiple noisy miminum spanning trees.

The algorithm operates on HDBSCAN’s mutual reachability Euclidean distance. The resulting graph is embedded with UMAP as if it contains normal k nearest neighbors.

Parameters:
data: array-like

The data to construct a MST for.

noise_fraction:

Adds Gaussian noise with scale=noise_fraction * distance to every computed distance value.

num_trees: int, optional

The number of noisy MSTS to create. Default is 3.

min_samples: int, optional

The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.

umap_kwargs: dict

Additional keyword arguments to pass to UMAP.

Returns:
mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The noisyMST edges in kNN format.

mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The noisyMST edges in kNN format.

umap: umap.UMAP

A fitted UMAP object.

k-MST

class multi_mst.k_mst.KMST(*, num_neighbors=3, min_samples=1, epsilon=None, umap_kwargs=None)

An SKLEARN-style estimator for computing a k-MST of a dataset. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance).

The algorithm operates on HDBSCAN’s mutual reachability Euclidean distance. The resulting graph is embedded with UMAP as if it contains normal k nearest neighbors.

Parameters:
num_neighbors: int, optional

The number of edges to connect between each fragement. Default is 35.

min_samples: int, optional

The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.

epsilon: float, optional

A fraction of the initial MST edge distance to act as upper distance bound.

umap_kwargs: dict

Additional keyword arguments to pass to UMAP.

Attributes:
graph_: scipy.sparse matrix

The computed k-minimum spanning tree as sparse matrix with edge weights as UMAP edge probability.

embedding_: numpy.ndarray, shape (n_samples, num_components)

The low dimensional embedding of the data. None if transform_mode is ‘graph’.

mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

_umap: umap.UMAP

A fitted UMAP object used to compute the graph and embedding. Observations with infinite or missing values are not passed to the UMAP algorithm. The object’s attributes are not adjusted to account for these missing values. Use the graph_ and embedding_ attributes instead!

fit(X, y=None, **fit_params)

Computes the k-MST of the given data.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
self: KMST

The fitted estimator.

fit_transform(X, y=None, **fit_params)

Computes the k-MST of the given data.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
embedding: numpy.ndarray, shape (n_samples, num_components)

The computed low dimensional embedding. None if transform_mode is ‘graph’.

multi_mst.k_mst.kMST(data, num_neighbors=3, min_samples=1, epsilon=None, umap_kwargs=None)

Computes a k-MST of a dataset. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance).

The algorithm operates on HDBSCAN’s mutual reachability Euclidean distance. The resulting graph is embedded with UMAP as if it contains normal k nearest neighbors.

Parameters:
data: array-like

The data to construct a MST for.

num_neighbors: int, optional

The number of edges to connect between each fragement. Default is 3.

min_samples: int, optional

The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.

epsilon: float, optional

A fraction of the initial MST edge distance to act as upper distance bound.

umap_kwargs: dict

Additional keyword arguments to pass to UMAP.

Returns:
mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

umap: umap.UMAP

A fitted UMAP object.

k-MST Descent

class multi_mst.k_mst_descent.KMSTDescent(*, metric='euclidean', metric_kwds=None, num_neighbors=3, min_samples=1, epsilon=None, min_descent_neighbors=12, umap_kwargs=None, nn_kwargs=None, n_jobs=-1)

An SKLEARN-style estimator for computing approximate k-MSTs using NN-Descent. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance). Adapts NN-Descent to find MST edges, i.e., neighbours that are not already connected in the MST build up so far.

The algorithm operates on HDBSCAN’s mutual reachability distance, using a configurable distance metric. The result graph is embedded with UMAP as if it contains normal k-nearest neighbors.

Parameters:
data: array-like

The data to construct a MST for.

metric: string or callable (optional, default=’euclidean’)

The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:

  • euclidean

  • manhattan

  • chebyshev

  • minkowski

  • canberra

  • braycurtis

  • mahalanobis

  • wminkowski

  • seuclidean

  • cosine

  • correlation

  • haversine

  • hamming

  • jaccard

  • dice

  • russelrao

  • kulsinski

  • rogerstanimoto

  • sokalmichener

  • sokalsneath

  • yule

  • hellinger

  • wasserstein-1d

Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.

metric_kwds: dict (optional, default {})

Arguments to pass on to the metric, such as the p value for Minkowski distance.

num_neighbors: int, optional

The number of edges to connect between each fragement. Default is 3.

min_samples: int, optional

The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.

epsilon: float, optional

A fraction of the initial MST edge distance to act as upper distance bound.

min_descent_neighbors: int, optional

Runs the descent algorithm with more neighbors than we retain in the MST to improve convergence when num_neighbors is low. Default is 6.

umap_kwargs: dict

Additional keyword arguments passed to UMAP.

nn_kwargs: dict

Additional keyword arguments passsed to NNDescent.

n_jobsint, optional

The number of threads to use for the computation. -1 means using all threads.

Attributes:
graph_: scipy.sparse matrix

The computed k-minimum spanning tree as sparse matrix with edge weights as UMAP edge probability.

embedding_: numpy.ndarray, shape (n_samples, num_components)

The low dimensional embedding of the data. None if transform_mode is ‘graph’.

mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

_umap: umap.UMAP

A fitted UMAP object used to compute the graph and embedding. Observations with infinite or missing values are not passed to the UMAP algorithm. The object’s attributes are not adjusted to account for these missing values. Use the graph_ and embedding_ attributes instead!

fit(X, y=None, **fit_params)

Computes the k-MST of the given data.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
self: DescentMST

The fitted estimator.

fit_transform(X, y=None, **fit_params)

Computes the k-MST of the given data.

Parameters:
X: array-like

The data to construct the MST for.

y: array-like, optional

Ignored.

**fit_params: dict

Ignored.

Returns:
embedding: numpy.ndarray, shape (n_samples, num_components)

The computed low dimensional embedding. None if transform_mode is ‘graph’.

multi_mst.k_mst_descent.kMSTDescent(data, metric='euclidean', metric_kwds=None, num_neighbors=3, min_samples=1, epsilon=None, min_descent_neighbors=12, umap_kwargs=None, nn_kwargs=None, n_jobs=-1)

Computes an approximate k-MST using NN-Descent. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance). Adapts NN-Descent to find MST edges, i.e., neighbours that are not already connected in the MST build up so far.

The algorithm operates on HDBSCAN’s mutual reachability distance, using a configurable distance metric. The result graph is embedded with UMAP as if it contains normal k-nearest neighbors.

Parameters:
data: array-like

The data to construct a MST for.

metric: string or callable (optional, default=’euclidean’)

The metric to use for computing nearest neighbors. If a callable is used it must be a numba njit compiled function. Supported metrics include:

  • euclidean

  • manhattan

  • chebyshev

  • minkowski

  • canberra

  • braycurtis

  • mahalanobis

  • wminkowski

  • seuclidean

  • cosine

  • correlation

  • haversine

  • hamming

  • jaccard

  • dice

  • russelrao

  • kulsinski

  • rogerstanimoto

  • sokalmichener

  • sokalsneath

  • yule

  • hellinger

  • wasserstein-1d

Metrics that take arguments (such as minkowski, mahalanobis etc.) can have arguments passed via the metric_kwds dictionary. At this time care must be taken and dictionary elements must be ordered appropriately; this will hopefully be fixed in the future.

metric_kwds: dict (optional, default {})

Arguments to pass on to the metric, such as the p value for Minkowski distance.

num_neighbors: int, optional

The number of edges to connect between each fragement. Default is 3.

min_samples: int, optional

The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.

epsilon: float, optional

A fraction of the initial MST edge distance to act as upper distance bound.

min_descent_neighbors: int, optional

Runs the descent algorithm with more neighbors than we retain in the MST to improve convergence when num_neighbors is low. Default is 6.

umap_kwargs: dict

Additional keyword arguments passed to UMAP.

nn_kwargs: dict

Additional keyword arguments passsed to NNDescent.

n_jobsint, optional

The number of threads to use for the computation. -1 means using all threads.

Returns:
mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

umap: umap.UMAP

A fitted UMAP object.

Recall logging k-MST Descent

multi_mst.k_mst_descent_recall._KMSTDescentLogRecall

alias of KMSTDescentLogRecall

multi_mst.k_mst_descent_recall._kMSTDescentLogRecall(data, num_neighbors=3, min_samples=1, epsilon=None, min_descent_neighbors=12, umap_kwargs=None, nn_kwargs=None, n_jobs=-1)

Computes approximate k-MSTs using NN-Descent. Adapts the boruvka algorithm to look for k candidate edges per point, of which the k best per connected component are retained (up to epsilon times the shortest distance). Adapts NN-Descent to find MST edges, i.e., neighbours that are not already connected in the MST build up so far.

This version keeps track of the recall between ground-truth and the approximate MST edges. Recall is computed in every descent iteration and in every Boruvka iteration.

The algorithm operates on HDBSCAN’s mutual reachability distance, using a configurable distance metric. The result graph is embedded with UMAP as if it contains normal k-nearest neighbors.

Parameters:
data: array-like

The data to construct a MST for.

num_neighbors: int, optional

The number of edges to connect between each fragement. Default is 3.

min_samples: int, optional

The number of neighbors for computing the mutual reachability distance. Value must be lower or equal to the number of neighbors. epsilon operates on the mutual reachability distance, so always allows the nearest min_samples points. Acts as UMAP’s local connnectivity parameter. Default is 1.

epsilon: float, optional

A fraction of the initial MST edge distance to act as upper distance bound.

min_descent_neighbors: int, optional

Runs the descent algorithm with more neighbors than we retain in the MST to improve convergence when num_neighbors is low. Default is 6.

umap_kwargs: dict

Additional keyword arguments passed to UMAP.

nn_kwargs: dict

Additional keyword arguments passsed to NNDescent.

n_jobsint, optional

The number of threads to use for the computation. -1 means using all threads.

Returns:
mst_indices_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

mst_distances_: numpy.ndarray, shape (n_samples, num_found_neighbors)

The kMST edges in kNN format.

umap: umap.UMAP

A fitted UMAP object.

trace: list[dict]

A trace of the recall across Boruvka iterations (outer) and nn descent iterations (inner).