API Reference

flasc - a Python implementation of FLASC.

FLASC* is a branch-aware clustering algorithm, that builds upon hdbscan to detect branching structures within clusters. The algorithm returns a labelling that separates noise points from clusters and branches from each other. In addition, the single-linkage and condensed linkage hierarchies are provided for both the clustering as the branch detection stages.

class flasc.FLASC(min_cluster_size=5, min_branch_size=None, min_samples=None, metric='minkowski', p=2, alpha=1.0, algorithm='best', leaf_size=40, approx_min_span_tree=True, max_cluster_size=0, cluster_selection_method='eom', allow_single_cluster=False, cluster_selection_epsilon=0.0, cluster_selection_persistence=0.0, max_branch_size=0, allow_single_branch=False, branch_detection_method='full', branch_selection_method='eom', branch_selection_epsilon=0.0, branch_selection_persistence=0.0, label_sides_as_branches=False, memory=Memory(location=None), num_jobs=None, **kwargs)

Performs hdbscan clustering with flare detection post-processing step.

FLASC - Flare-Sensitive Clustering. Performs hdbscan clustering [1]_ with a post-processing step to detect branches within individual clusters. For each cluster, a graph is constructed connecting the data points based on their mutual reachability distances. Each edge is given a centrality value based on how far it lies from the cluster’s center. Then, the edges are clustered as if that centrality was a density, progressively removing the ‘centre’ of each cluster and seeing how many branches remain.

Parameters:

X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if metric='precomputed'.
min_cluster_size (int, optional (default=5)) – The minimum number of samples in a group for that group to be considered a cluster; groupings smaller than this size will be left as noise.
min_branch_size (int, optional (default=None)) – The minimum number of samples in a group for that group to be considered a branch; groupings smaller than this size will seen as points falling out of a branch. Defaults to the min_cluster_size.
min_samples (int, optional (default=None)) – The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. Defaults to the min_cluster_size.
metric (str or callable, optional (default='minkowski')) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.
p (int, optional (default=2)) – p value to use if using the minkowski metric.
alpha (float, optional (default=1.0)) – A distance scaling parameter as used in robust single linkage. See [2]_ for more information.
algorithm (str, optional (default='best')) –
Exactly which algorithm to use; hdbscan has variants specialized for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:
- best
- generic
- prims_kdtree
- prims_balltree
- boruvka_kdtree
- boruvka_balltree
leaf_size (int, optional (default=40)) – Leaf size for trees responsible for fast nearest neighbor queries.
approx_min_span_tree (bool, optional (default=True)) – Whether to accept an only approximate minimum spanning tree. For some algorithms this can provide a significant speedup, but the resulting clustering may be of marginally lower quality. If you are willing to sacrifice speed for correctness you may want to explore this; in general this should be left at the default True.
cluster_selection_method (str, optional (default='eom')) –
The method used to select clusters from the condensed tree. The standard approach for FLASC is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:
- eom
- leaf
allow_single_cluster (bool, optional (default=False)) – By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. (default False)
cluster_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Clusters below this value will be merged. See [3]_ for more information. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict()), as the that function is not aware of this argument.
max_cluster_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless) and can also be overridden in rare cases by a high value for cluster_selection_epsilon. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict()), as the approximate_predict function is not aware of this argument.
allow_single_branch (bool, optional (default=False)) – Analogous to allow_single_cluster. Note that depending on label_sides_as_branches FLASC* requires at least 3 branches to exist in a cluster before they are incorporated in the final labelling.
branch_detection_method (str, optional (default=``full``)) –
Determines which graph is constructed to detect branches with. Valid values are, ordered by increasing computation cost and decreasing sensitivity to noise:
- core: Contains the edges that connect each point to all other
points within a mutual reachability distance lower than or equal to the point’s core distance. This is the cluster’s subgraph of the k-NN graph over the entire data set (with k = min_samples).
- full: Contains all edges between points in each cluster with a mutual reachability distance lower than or equal to the distance of the most-distance point in each cluster. These graphs represent the 0-dimensional simplicial complex of each cluster at the first point in the filtration where they contain all their points.
branch_selection_persistence (float, optional (default=0.0)) – A centrality persistence threshold. Branches with a persistence below this value will be merged. See [3]_ for more information. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict() function is not aware of this argument.
branch_selection_method (str, optional (default='eom')) –
The method used to select branches from the cluster’s condensed tree. The standard approach for FLASC* is to use the eom approach. Options are:
- eom
- leaf
max_branch_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless). Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict()), as that function is not aware of this argument.
label_sides_as_branches (bool, optional (default=False),) – When this flag is False, branches are only labelled for clusters with at least three branches (i.e., at least y-shapes). Clusters with only two branches represent l-shapes. The two branches describe the cluster’s outsides growing towards each other. Enabling this flag separates these branches from each other in the produced labelling.
memory (instance of joblib.Memory or str, optional) – Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.
num_jobs (int, optional (default=None)) – Number of parallel jobs to run in core distance computations and branch detection step. For num_jobs below -1, (n_cpus + 1 + num_jobs) are used. By default, the algorithm tries to estimate whether the given input is large enough for multi-processing to have a benefit. If so, 4 processes are started, otherwise only the main process is used. When a num_jobs value is given, that number of jobs is used regardless of the input size.
**kwargs (optional) – Additional arguments passed to hdbscan() or the distance metric.

labels_

Cluster+branch labels for each point. Noisy samples are given the label -1.

Type:: np.ndarray, shape (n_samples, )

probabilities_

Cluster+branch membership strengths for each point. Noisy samples are assigned 0.

Type:: np.ndarray, shape (n_samples, )

cluster_labels_

Cluster labels for each point. Noisy samples are given the label -1.

Type:: np.ndarray, shape (n_samples, )

cluster_probabilities_

Cluster membership strengths for each point. Noisy samples are assigned 0.

Type:: np.ndarray, shape (n_samples, )

cluster_persistences_

A score of how persistent each cluster is. A score of 1.0 represents a perfectly stable cluster that persists over all distance scales, while a score of 0.0 represents a perfectly ephemeral cluster. These scores gauge the relative coherence of the clusters output by the algorithm. Not available when override_cluster_labels is used.

Type:: array, shape (n_clusters, )

branch_labels_

Branch labels for each point. Noisy samples are given the label -1.

Type:: np.ndarray, shape (n_samples, )

branch_probabilities_

Branch membership strengths for each point. Noisy samples are assigned 0.

Type:: np.ndarray, shape (n_samples, )

branch_persistences_

A branch persistence for each cluster produced during the branch detection step.

Type:: tuple (n_clusters)

condensed_tree_

The condensed tree hierarchy used to generate clusters. Not available when override_cluster_labels is used.

Type:: CondensedTree

single_linkage_tree_

The single linkage tree produced during clustering in scipy hierarchical clustering format (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Not available when override_cluster_labels is used.

Type:: SingleLinkageTree

min_spanning_tree_

The minimum spanning as an edgelist. Not available when override_cluster_labels is used.

Type:: MinimumSpanningTree

cluster_approximation_graph_

The graphs used to detect branches in each cluster as an ApproximationGraph. Can be converted to a networkx graph, pandas data frame, or a list with numpy array-edgelists. Points are labelled by their row-index into the input data. The edges contained in the graph depend on the branch_detection_method: - core: Contains the edges that connect each point to all other

points in a cluster within a mutual reachability distance lower than or equal to the point’s core distance. This is an extension of the minimum spanning tree introducing only edges with equal distances. The reachability distance introduces num_points * min_samples of such edges.

full: Contains all edges between points in each cluster with a mutual reachability distance lower than or equal to the distance of the most-distance point in each cluster. These graphs represent the 0-dimensional simplicial complex of each cluster at the first point in the filtration where they contain all their points.

Type:: ApproximationGraph

branch_condensed_trees

Condensed hierarchies for each cluster produced during the branch detection step. Data points are numbered with in-cluster ids.

Type:: list[CondensedTree]

branch_linkage_trees_

Single linkage trees for each cluster produced during the branch detection step, in the scipy hierarchical clustering format. (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Data points are numbered with in-cluster ids.

Type:: list[SingleLinkageTree]

cluster_centralities_

Centrality values for each point in a cluster. Overemphasizes points’ eccentricity within the cluster as the values are based on minimum spanning trees that do not contain the equally distanced edges resulting from the mutual reachability distance.

Type:: np.ndarray, shape (n_samples, )

cluster_points_

The data point row indices for each detected cluster: cluster_points_[l] = np.where(cluster_labels == l)[0]

Type:: list (n_clusters)

cluster_exemplars_

A list of exemplar points for clusters. Since HDBSCAN supports arbitrary shapes for clusters we cannot provide a single cluster exemplar per cluster. Instead a list is returned with each element of the list being a numpy array of exemplar points for a cluster – these points are the “most representative” points of the cluster. Not available when override_cluster_labels is used or a precomputed distance matrix is given as input.

Type:: list

branch_exemplars_

A list with exemplar points for the branches in the clusters. A cluster’s item is empty if it does not have selected branches. For clusters with selected branches, a list with a numpy array of exemplar points for each selected branch is given.

Type:: list (n_clusters)

relative_validity_

HDBSCAN’s fast approximation of the Density Based Cluster Validity ( DBCV) score [4] on FLASC’s labelling. It may only be used to compare results across different choices of hyper-parameters, therefore is only a relative score.

Type:: float

hdbscan_

An HDBSCAN clusterer object fitted to the data. Can be used to compute outlier scores and cluster exemplars. Not available when override_cluster_labels is used.

Type:: HDBSCAN

References

property branch_condensed_trees_: See FLASC for documentation.

property branch_exemplars_: See FLASC for documentation.

property branch_linkage_trees_: See FLASC for documentation.

property cluster_approximation_graph_: See FLASC for documentation.

property cluster_exemplars_: See FLASC for documentation.

property condensed_tree_: See FLASC for documentation.

fit(X: ndarray, labels: ndarray | None = None, probabilities: ndarray | None = None)

Performs the branch aware clustering.

Parameters:

X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if metric='precomputed'.
labels (array of shape (n_samples, 1), or None) –
Labels that override the FLASC clustering to specify your own grouping with a numpy array containing a cluster label for each data point. Negative values will be interpreted as noise points. When the parameter is not set to None, core distances are computed over all data points, minimum spanning trees and the branches are computed per cluster. Consequently, the manually specified clusters do not have to form neatly separable connected components in the minimum spanning tree over all the data points.

Because the clustering step is skipped, some of the output variables and the approximate_predict() function will be unavailable: - cluster_persistence - condensed_tree - single_linkage_tree - min_spanning_tree
probabilities (array of shape (n_samples, 1), or None) – Specifying a not None value for this parameter is only valid when override_cluster_labels is used. In that case, this parameter controls the data point cluster membership probabilities. When this parameter is None, a default 1.0 probability is used for all points.

Returns:

self – Returns self

Return type:

object

fit_predict(X: ndarray, labels: ndarray | None = None, probabilities: ndarray | None = None)

Performs clustering on X and returns cluster labels.

Parameters:

X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if metric='precomputed'.
labels (array of shape (n_samples, 1), or None) –
Labels that override the FLASC clustering to specify your own grouping with a numpy array containing a cluster label for each data point. Negative values will be interpreted as noise points. When the parameter is not set to None, core distances are computed over all data points, minimum spanning trees and the branches are computed per cluster. Consequently, the manually specified clusters do not have to form neatly separable connected components in the minimum spanning tree over all the data points.

Because the clustering step is skipped, some of the output variables and the approximate_predict() function will be unavailable: - cluster_persistence - condensed_tree - single_linkage_tree - min_spanning_tree
probabilities (array of shape (n_samples, 1), or None) – Specifying a not None value for this parameter is only valid when override_cluster_labels is used. In that case, this parameter controls the data point cluster membership probabilities. When this parameter is None, a default 1.0 probability is used for all points.

Returns:

y – cluster labels.

Return type:

np.ndarray, shape (n_samples, )

property hdbscan_: See FLASC for documentation.

property minimum_spanning_tree_: See FLASC for documentation.

property relative_validity_: See FLASC for documentation.

set_fit_request(*, labels: bool | None | str = '$UNCHANGED$', probabilities: bool | None | str = '$UNCHANGED$') → FLASC

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in fit.
probabilities (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for probabilities parameter in fit.

Returns:

self – The updated object.

Return type:

object

property single_linkage_tree_: See FLASC for documentation.

weighted_centroid(label_id, data=None)

Provides an approximate representative point for a given branch. Note that this technique assumes a euclidean metric for speed of computation. For more general metrics use the weighted_medoid() method which is slower, but can work with the metric the model trained with.

Parameters:

label_id (int) – The id of the cluster to compute a centroid for.
data (np.ndarray (n_samples, n_features), optional (default=None)) – A dataset to use instead of the raw data that was clustered on.

Returns:

centroid – A representative centroid for cluster label_id.

Return type:

array of shape (n_features,)

weighted_cluster_centroid(cluster_id)

Wraps hdbscan.HDBSCAN.weighted_cluster_centroid().

Provides an approximate representative point for a given cluster. Note that this technique assumes a euclidean metric for speed of computation. For more general metrics use the weighted_cluster_medoid() method which is slower, but can work with the metric the model trained with.

Parameters:: cluster_id (int) – The id of the cluster to compute a centroid for.
Returns:: centroid – A representative centroid for cluster cluster_id.
Return type:: array of shape (n_features,)

weighted_cluster_medoid(cluster_id)

Wraps hdbscan.HDBSCAN.weighted_cluster_medoid().

Provides an approximate representative point for a given cluster. Note that this technique can be very slow and memory intensive for large clusters. For faster results use the weighted_cluster_centroid() method which is faster, but assumes a euclidean metric.

Parameters:: cluster_id (int) – The id of the cluster to compute a medoid for.
Returns:: centroid – A representative medoid for cluster cluster_id.
Return type:: array of shape (n_features,)

weighted_medoid(label_id, data=None)

Provides an approximate representative point for a given branch.

Note that this technique can be very slow and memory intensive for large clusters. For faster results use the weighted_centroid() method which is faster, but assumes a euclidean metric.

Parameters:

label_id (int) – The id of the cluster to compute a medoid for.
data (np.ndarray (n_samples, n_features), optional (default=None)) – A dataset to use instead of the raw data that was clustered on.

Returns:

centroid – A representative medoid for cluster label_id.

Return type:

array of shape (n_features,)

flasc.flasc(X, override_cluster_labels=None, override_cluster_probabilities=None, min_cluster_size=5, min_branch_size=None, min_samples=None, metric='minkowski', p=2, alpha=1.0, algorithm='best', leaf_size=40, approx_min_span_tree=True, max_cluster_size=0, cluster_selection_method='eom', allow_single_cluster=False, cluster_selection_epsilon=0.0, cluster_selection_persistence=0.0, max_branch_size=0, allow_single_branch=False, branch_detection_method='full', branch_selection_method='eom', branch_selection_epsilon=0.0, branch_selection_persistence=0.0, label_sides_as_branches=False, memory=Memory(location=None), num_jobs=None, **kwargs)

Performs FLASC clustering with flare detection post-processing step.

FLASC - Flare-Sensitive Clustering. Performs hdbscan clustering [1]_ with a post-processing step to detect branches within individual clusters. For each cluster, a graph is constructed connecting the data points based on their mutual reachability distances. Each edge is given a centrality value based on how far it lies from the cluster’s center. Then, the edges are clustered as if that centrality was a distance, progressively removing the ‘center’ of each cluster and seeing how many branches remain.

Parameters:

X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples) A feature array, or array of) – distances between samples if metric='precomputed'.
override_cluster_labels (np.ndarray, optional (default=None)) –
Override the FLASC clustering to specify your own grouping with a numpy array containing a cluster label for each data point. Negative values will be interpreted as noise points. When the parameter is not set to None, core distances are computed over all data points, minimum spanning trees and the branches are computed per cluster. Consequently, the manually specified clusters do not have to form neatly separable connected components in the minimum spanning tree over all the data points.

Because the clustering step is skipped, some of the output variables and the approximate_predict() function will be unavailable: - cluster_persistence - condensed_tree - single_linkage_tree - min_spanning_tree
override_cluster_probabilities (np.ndarray, optional (default=None)) – Specifying a not None value for this parameter is only valid when override_cluster_labels is used. In that case, this parameter controls the data point cluster membership probabilities. When this parameter is None, a default 1.0 probability is used for all points.
min_cluster_size (int, optional (default=5)) – The minimum number of samples in a group for that group to be considered a cluster; groupings smaller than this size will be left as noise.
min_branch_size (int, optional (default=None)) – The minimum number of samples in a group for that group to be considered a branch; groupings smaller than this size will seen as points falling out of a branch. Defaults to the min_cluster_size.
min_samples (int, optional (default=None)) – The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. Defaults to the min_cluster_size.
metric (str or callable, optional (default='minkowski')) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.
p (int, optional (default=2)) – p value to use if using the minkowski metric.
alpha (float, optional (default=1.0)) – A distance scaling parameter as used in robust single linkage. See [2]_ for more information.
algorithm (str, optional (default='best')) –
Exactly which algorithm to use; hdbscan has variants specialized for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:
- best
- generic
- prims_kdtree
- prims_balltree
- boruvka_kdtree
- boruvka_balltree
leaf_size (int, optional (default=40)) – Leaf size for trees responsible for fast nearest neighbor queries.
approx_min_span_tree (bool, optional (default=True)) – Whether to accept an only approximate minimum spanning tree. For some algorithms this can provide a significant speedup, but the resulting clustering may be of marginally lower quality. If you are willing to sacrifice speed for correctness you may want to explore this; in general this should be left at the default True.
max_cluster_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless) and can also be overridden in rare cases by a high value for cluster_selection_epsilon. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.
cluster_selection_method (str, optional (default='eom')) –
The method used to select clusters from the condensed tree. The standard approach for FLASC is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:
- eom
- leaf
allow_single_cluster (bool, optional (default=False)) – By default FLASC will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. (default False)
cluster_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Clusters below this value will be merged. See [3]_ for more information. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.
cluster_selection_persistence (float, optional (default=0.0)) – A persistence threshold. Clusters with a persistence below this value will be merged. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.
max_branch_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless). Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.
allow_single_branch (bool, optional (default=False)) – Analogous to allow_single_cluster. Note that depending on label_sides_as_branches FLASC requires at least 3 branches to exist in a cluster before they are incorporated in the final labelling.
branch_detection_method (str, optional (default=``full``)) –
Determines which graph is constructed to detect branches with. Valid values are, ordered by increasing computation cost and decreasing sensitivity to noise: - core: Contains the edges that connect each point to all other

points within a mutual reachability distance lower than or equal to the point’s core distance. This is the cluster’s subgraph of the k-NN graph over the entire data set (with k = min_samples).
- full: Contains all edges between points in each cluster with a mutual reachability distance lower than or equal to the distance of the most-distance point in each cluster. These graphs represent the 0-dimensional simplicial complex of each cluster at the first point in the filtration where they contain all their points.
branch_selection_method (str, optional (default='eom')) –
The method used to select branches from the cluster’s condensed tree. The standard approach for FLASC is to use the eom approach. Options are:
- eom
- leaf
branch_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Branches below this value will be merged. See [3]_ for more information. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.
branch_selection_persistence (float, optional (default=0.0)) – A persistence threshold. Branches with a persistence below this value will be merged. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.
label_sides_as_branches (bool, optional (default=False),) – When this flag is False, branches are only labelled for clusters with at least three branches (i.e., at least y-shapes). Clusters with only two branches represent l-shapes. The two branches describe the cluster’s outsides growing towards each other. Enabling this flag separates these branches from each other in the produced labelling.
memory (instance of joblib.Memory or str, optional) – Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.
num_jobs (int, optional (default=None)) – Number of parallel jobs to run in core distance computations and branch detection step. For num_jobs below -1, (n_cpus + 1 + num_jobs) are used. By default, the algorithm tries to estimate whether the given input is large enough for multi-processing to have a benefit. If so, 4 processes are started, otherwise only the main process is used. When a num_jobs value is given, that number of jobs is used regardless of the input size.
**kwargs (optional) – Additional arguments passed to hdbscan() or the distance metric.

Returns:

labels (np.ndarray, shape (n_samples, )) – Cluster+branch labels for each point. Noisy samples are given the label -1.
probabilities (np.ndarray, shape (n_samples, )) – Cluster+branch membership strengths for each point. Noisy samples are assigned 0.
cluster_labels (np.ndarray, shape (n_samples, )) – Cluster labels for each point. Noisy samples are given the label -1.
cluster_probabilities (np.ndarray, shape (n_samples, )) – Cluster membership strengths for each point. Noisy samples are assigned 0.
cluster_persistence (array, shape (n_clusters, )) – A score of how persistent each cluster is. A score of 1.0 represents a perfectly stable cluster that persists over all distance scales, while a score of 0.0 represents a perfectly ephemeral cluster. These scores gauge the relative coherence of the clusters output by the algorithm. Not available when override_cluster_labels is used.
branch_labels (np.ndarray, shape (n_samples, )) – Branch labels for each point. Noisy samples are given the label -1.
branch_probabilities (np.ndarray, shape (n_samples, )) – Branch membership strengths for each point. Noisy samples are assigned 0.
branch_persistences (tuple (n_clusters)) – A branch persistence (eccentricity range) for each detected branch.
condensed_tree (record array) – The condensed cluster hierarchy used to generate clusters. Not available when override_cluster_labels is used.
single_linkage_tree (np.ndarray, shape (n_samples - 1, 4)) – The single linkage tree produced during clustering in scipy hierarchical clustering format (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Not available when override_cluster_labels is used.
min_spanning_tree (np.ndarray, shape (n_samples - 1, 3)) – The minimum spanning as an edgelist. Not available when override_cluster_labels is used.
cluster_approximation_graphs (tuple (n_clusters)) – The graphs used to detect branches in each cluster stored as a numpy array with four columns: source, target, centrality, mutual reachability distance. Points are labelled by their row-index into the input data. The edges contained in the graphs depend on the branch_detection_method: - core: Contains the edges that connect each point to all other

points in a cluster within a mutual reachability distance lower than or equal to the point’s core distance. This is an extension of the minimum spanning tree introducing only edges with equal distances. The reachability distance introduces num_points * min_samples of such edges.
- full: Contains all edges between points in each cluster with a mutual reachability distance lower than or equal to the distance of the most-distance point in each cluster. These graphs represent the 0-dimensional simplicial complex of each cluster at the first point in the filtration where they contain all their points.
branch_condensed_trees (tuple (n_clusters)) – A condensed branch hierarchy for each cluster produced during the branch detection step. Data points are numbered with in-cluster ids.
branch_linkage_trees (tuple (n_clusters)) – A single linkage tree for each cluster produced during the branch detection step, in the scipy hierarchical clustering format. (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Data points are numbered with in-cluster ids.
cluster_centralities (np.ndarray, shape (n_samples, )) – Centrality values for each point in a cluster. Overemphasizes points’ eccentricity within the cluster as the values are based on minimum spanning trees that do not contain the equally distanced edges resulting from the mutual reachability distance.
cluster_points (list (n_clusters)) – The data point row indices for each cluster.

References

Predictions

Predictions functions for FLASC.