API Reference

flasc - a Python implementation of FLASC.

FLASC* is a branch-aware clustering algorithm, that builds upon hdbscan to detect branching structures within clusters. The algorithm returns a labelling that separates noise points from clusters and branches from each other. In addition, the single-linkage and condensed linkage hierarchies are provided for both the clustering as the branch detection stages.

class flasc.FLASC(min_cluster_size=5, min_branch_size=None, min_samples=None, metric='minkowski', p=2, alpha=1.0, algorithm='best', leaf_size=40, approx_min_span_tree=True, cluster_selection_method='eom', allow_single_cluster=False, cluster_selection_epsilon=0.0, max_cluster_size=0, allow_single_branch=False, branch_detection_method='full', branch_selection_method='eom', branch_selection_persistence=0.0, max_branch_size=0, label_sides_as_branches=False, override_cluster_labels=None, override_cluster_probabilities=None, memory=Memory(location=None), num_jobs=None, **kwargs)

Performs hdbscan clustering with flare detection post-processing step.

FLASC - Flare-Sensitive Clustering. Performs hdbscan clustering [1]_ with a post-processing step to detect branches within individual clusters. For each cluster, a graph is constructed connecting the data points based on their mutual reachability distances. Each edge is given a centrality value based on how far it lies from the cluster’s center. Then, the edges are clustered as if that centrality was a density, progressively removing the ‘centre’ of each cluster and seeing how many branches remain.

Parameters:
  • X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if metric='precomputed'.

  • min_cluster_size (int, optional (default=5)) – The minimum number of samples in a group for that group to be considered a cluster; groupings smaller than this size will be left as noise.

  • min_branch_size (int, optional (default=None)) – The minimum number of samples in a group for that group to be considered a branch; groupings smaller than this size will seen as points falling out of a branch. Defaults to the min_cluster_size.

  • min_samples (int, optional (default=None)) – The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. Defaults to the min_cluster_size.

  • metric (str or callable, optional (default='minkowski')) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.

  • p (int, optional (default=2)) – p value to use if using the minkowski metric.

  • alpha (float, optional (default=1.0)) – A distance scaling parameter as used in robust single linkage. See [2]_ for more information.

  • algorithm (str, optional (default='best')) –

    Exactly which algorithm to use; hdbscan has variants specialised for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:

    • best

    • generic

    • prims_kdtree

    • prims_balltree

    • boruvka_kdtree

    • boruvka_balltree

  • leaf_size (int, optional (default=40)) – Leaf size for trees responsible for fast nearest neighbour queries.

  • approx_min_span_tree (bool, optional (default=True)) – Whether to accept an only approximate minimum spanning tree. For some algorithms this can provide a significant speedup, but the resulting clustering may be of marginally lower quality. If you are willing to sacrifice speed for correctness you may want to explore this; in general this should be left at the default True.

  • cluster_selection_method (str, optional (default='eom')) –

    The method used to select clusters from the condensed tree. The standard approach for FLASC is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

    • eom

    • leaf

  • allow_single_cluster (bool, optional (default=False)) – By default HDBSCAN* will not produce a single cluster, setting this to t=True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. (default False)

  • cluster_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Clusters below this value will be merged. See [3]_ for more information. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict()), as the that function is not aware of this argument.

  • max_cluster_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless) and can also be overridden in rare cases by a high value for cluster_selection_epsilon. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict()), as the approximate_predict function is not aware of this argument.

  • allow_single_branch (bool, optional (default=False)) – Analogous to allow_single_cluster. Note that depending on label_sides_as_branches FLASC* requires at least 3 branches to exist in a cluster before they are incorporated in the final labelling.

  • branch_detection_method (str, optional (default=``full``)) –

    Deteremines which graph is conctructed to detect branches with. Valid values are, ordered by increasing computation cost and decreasing sensitivity to noise:

    • core: Contains the edges that connect each point to all other

    points within a mutual reachability distance lower than or equal to the point’s core distance. This is the cluster’s subgraph of the k-NN graph over the entire data set (with k = min_samples).

    • full: Contains all edges between points in each cluster with a mutual reachability distance lower than or equal to the distance of the most-distance point in each cluster. These graphs represent the 0-dimensional simplicial complex of each cluster at the first point in the filtration where they contain all their points.

  • branch_selection_persistence (float, optional (default=0.0)) – A centrality persistence threshold. Branches with a persistence below this value will be merged. See [3]_ for more information. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict() function is not aware of this argument.

  • branch_selection_method (str, optional (default='eom')) –

    The method used to select branches from the cluster’s condensed tree. The standard approach for FLASC* is to use the eom approach. Options are:

    • eom

    • leaf

  • max_branch_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless). Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict()), as that function is not aware of this argument.

  • label_sides_as_branches (bool, optional (default=False),) – When this flag is False, branches are only labelled for clusters with at least three branches (i.e., at least y-shapes). Clusters with only two branches represent l-shapes. The two branches describe the cluster’s outsides growing towards each other. Enableing this flag separates these branches from each other in the produced labelling.

  • override_cluster_labels (np.ndarray, optional (default=None)) –

    Override the HDBSCAN* clustering to specify your own grouping with a numpy array containing a cluster label for each data point. Negative values will be interpreted as noise points. When the parameter is not set to None, core distances are computed over all data points, minimum spanning trees and the branches are computed per cluster. Consequently, the manually specified clusters do not have to form neatly separable connected components in the minimum spanning tree over all the data points.

    Because the clustering step is skipped, some of the attributes and the approximate_predict() function will be unavailable: - cluster_persistence_ - condensed_tree_ - single_linkage_tree_ - min_spanning_tree_ - cluster_exemplars_

  • override_cluster_probabilities (np.ndarray, optional (default=None)) – Specifying a not None value for this parameter is only valid when override_cluster_labels is used. In that case, this parameter controls the data point cluster membership probabilities. When this parameter is None, a default 1.0 probability is used for all points.

  • memory (instance of joblib.Memory or str, optional) – Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.

  • num_jobs (int, optional (default=None)) – Number of parallel jobs to run in core distance computations and branch detection step. For num_jobs below -1, (n_cpus + 1 + num_jobs) are used. By default, the algorithm tries to estimate whether the given input is large enough for multi-processing to have a benefit. If so, 4 processes are started, otherwise only the main process is used. When a num_jobs value is given, that number of jobs is used regardless of the input size.

  • **kwargs (optional) – Additional arguments passed to hdbscan() or the distance metric.

labels_

Cluster+branch labels for each point. Noisy samples are given the label -1.

Type:

np.ndarray, shape (n_samples, )

probabilities_

Cluster+branch membership strengths for each point. Noisy samples are assigned 0.

Type:

np.ndarray, shape (n_samples, )

cluster_labels_

Cluster labels for each point. Noisy samples are given the label -1.

Type:

np.ndarray, shape (n_samples, )

cluster_probabilities_

Cluster membership strengths for each point. Noisy samples are assigned 0.

Type:

np.ndarray, shape (n_samples, )

cluster_persistences_

A score of how persistent each cluster is. A score of 1.0 represents a perfectly stable cluster that persists over all distance scales, while a score of 0.0 represents a perfectly ephemeral cluster. These scores gauge the relative coherence of the clusters output by the algorithm. Not available when override_cluster_labels is used.

Type:

array, shape (n_clusters, )

branch_labels_

Branch labels for each point. Noisy samples are given the label -1.

Type:

np.ndarray, shape (n_samples, )

branch_probabilities_

Branch membership strengths for each point. Noisy samples are assigned 0.

Type:

np.ndarray, shape (n_samples, )

branch_persistences_

A branch persistence for each cluster produced during the branch detection step.

Type:

tuple (n_clusters)

condensed_tree_

The condensed tree hierarchy used to generate clusters. Not available when override_cluster_labels is used.

Type:

CondensedTree

single_linkage_tree_

The single linkage tree produced during clustering in scipy hierarchical clustering format (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Not available when override_cluster_labels is used.

Type:

SingleLinkageTree

min_spanning_tree_

The minimum spanning as an edgelist. Not available when override_cluster_labels is used.

Type:

MinimumSpanningTree

cluster_approximation_graph_

The graphs used to detect branches in each cluster as an ApproximationGraph. Can be converted to a networkx graph, pandas data frame, or a list with numpy array-edgelists. Points are labelled by their row-index into the input data. The edges contained in the graph depend on the branch_detection_method: - core: Contains the edges that connect each point to all other

points in a cluster within a mutual reachability distance lower than or equal to the point’s core distance. This is an extension of the minimum spanning tree introducing only edges with equal distances. The reachability distance introduces num_points * min_samples of such edges.

  • full: Contains all edges between points in each cluster with a mutual reachability distance lower than or equal to the distance of the most-distance point in each cluster. These graphs represent the 0-dimensional simplicial complex of each cluster at the first point in the filtration where they contain all their points.

Type:

ApproximationGraph

cluster_condensed_trees

Condensed hierarchies for each cluster produced during the branch detection step. Data points are numbered with in-cluster ids.

Type:

list[CondensedTree]

cluster_linkage_trees_

Single linkage trees for each cluster produced during the branch detection step, in the scipy hierarchical clustering format. (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Data points are numbered with in-cluster ids.

Type:

list[SingleLinkageTree]

cluster_centralities_

Centrality values for each point in a cluster. Overemphasizes points’ eccentricity within the cluster as the values are based on minimum spanning trees that do not contain the equally distanced edges resulting from the mutual reachability distance.

Type:

np.ndarray, shape (n_samples, )

cluster_points_

The data point row indices for each detected cluster: cluster_points_[l] = np.where(cluster_labels == l)[0]

Type:

list (n_clusters)

cluster_exemplars_

A list of exemplar points for clusters. Since HDBSCAN supports arbitrary shapes for clusters we cannot provide a single cluster exemplar per cluster. Instead a list is returned with each element of the list being a numpy array of exemplar points for a cluster – these points are the “most representative” points of the cluster. Not available when override_cluster_labels is used or a precomputed distance matrix is given as input.

Type:

list

branch_exemplars_

A list with exemplar points for the branches in the clusters. A cluster’s item is empty if it does not have selected branches. For clusters with selected branches, a list with a numpy array of exemplar points for each selected branch is given.

Type:

list (n_clusters)

relative_validity_

HDBSCAN’s fast approximation of the Density Based Cluster Validity ( DBCV) score [4] on FLASC’s labelling. It may only be used to compare results across different choices of hyper-parameters, therefore is only a relative score.

Type:

float

hdbscan_

An HDBSCAN clusterer object fitted to the data. Can be used to compute outlier scores and cluster exemplars. Not available when override_cluster_labels is used.

Type:

HDBSCAN

References

property branch_exemplars_

See FLASC for documentation.

property cluster_approximation_graph_

See FLASC for documentation.

property cluster_condensed_trees_

See FLASC for documentation.

property cluster_exemplars_

See FLASC for documentation.

property cluster_linkage_trees_

See FLASC for documentation.

property condensed_tree_

See FLASC for documentation.

fit(X: ndarray, y=None)

Performs the branch aware clustering.

Parameters:
  • X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if metric='precomputed'.

  • y (Not used)

Returns:

self – Returns self

Return type:

object

fit_predict(X, y=None)

Performs clustering on X and returns cluster labels.

Parameters:
  • X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if metric='precomputed'.

  • y (not used)

Returns:

y – cluster labels.

Return type:

np.ndarray, shape (n_samples, )

property hdbscan_

See FLASC for documentation.

property minimum_spanning_tree_

See FLASC for documentation.

property relative_validity_

See FLASC for documentation.

property single_linkage_tree_

See FLASC for documentation.

weighted_centroid(label_id, data=None)

Provides an approximate representative point for a given branch. Note that this technique assumes a euclidean metric for speed of computation. For more general metrics use the weighted_medoid() method which is slower, but can work with the metric the model trained with.

Parameters:
  • label_id (int) – The id of the cluster to compute a centroid for.

  • data (np.ndarray (n_samples, n_features), optional (default=None)) – A dataset to use instead of the raw data that was clustered on.

Returns:

centroid – A representative centroid for cluster label_id.

Return type:

array of shape (n_features,)

weighted_cluster_centroid(cluster_id)

Wraps hdbscan.HDBSCAN.weighted_cluster_centroid().

Provides an approximate representative point for a given cluster. Note that this technique assumes a euclidean metric for speed of computation. For more general metrics use the weighted_cluster_medoid() method which is slower, but can work with the metric the model trained with.

Parameters:

cluster_id (int) – The id of the cluster to compute a centroid for.

Returns:

centroid – A representative centroid for cluster cluster_id.

Return type:

array of shape (n_features,)

weighted_cluster_medoid(cluster_id)

Wraps hdbscan.HDBSCAN.weighted_cluster_mentroid().

Provides an approximate representative point for a given cluster. Note that this technique can be very slow and memory intensive for large clusters. For faster results use the weighted_cluster_centroid() method which is faster, but assumes a euclidean metric.

Parameters:

cluster_id (int) – The id of the cluster to compute a medoid for.

Returns:

centroid – A representative medoid for cluster cluster_id.

Return type:

array of shape (n_features,)

weighted_medoid(label_id, data=None)

Provides an approximate representative point for a given branch.

Note that this technique can be very slow and memory intensive for large clusters. For faster results use the weighted_centroid() method which is faster, but assumes a euclidean metric.

Parameters:
  • label_id (int) – The id of the cluster to compute a medoid for.

  • data (np.ndarray (n_samples, n_features), optional (default=None)) – A dataset to use instead of the raw data that was clustered on.

Returns:

centroid – A representative medoid for cluster label_id.

Return type:

array of shape (n_features,)

flasc.flasc(X, min_cluster_size=5, min_branch_size=None, min_samples=None, metric='minkowski', p=2, alpha=1.0, algorithm='best', leaf_size=40, approx_min_span_tree=True, cluster_selection_method='eom', allow_single_cluster=False, cluster_selection_epsilon=0.0, max_cluster_size=0, allow_single_branch=False, branch_detection_method='full', branch_selection_method='eom', branch_selection_persistence=0.0, max_branch_size=0, label_sides_as_branches=False, override_cluster_labels=None, override_cluster_probabilities=None, memory=Memory(location=None), num_jobs=None, **kwargs)

Performs FLASC clustering with flare detection post-processing step.

FLASC - Flare-Sensitive Clustering. Performs hdbscan clustering [1]_ with a post-processing step to detect branches within individual clusters. For each cluster, a graph is constructed connecting the data points based on their mutual reachability distances. Each edge is given a centrality value based on how far it lies from the cluster’s center. Then, the edges are clustered as if that centrality was a distance, progressively removing the ‘center’ of each cluster and seeing how many branches remain.

Parameters:
  • X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if metric='precomputed'.

  • min_cluster_size (int, optional (default=5)) – The minimum number of samples in a group for that group to be considered a cluster; groupings smaller than this size will be left as noise.

  • min_branch_size (int, optional (default=None)) – The minimum number of samples in a group for that group to be considered a branch; groupings smaller than this size will seen as points falling out of a branch. Defaults to the min_cluster_size.

  • min_samples (int, optional (default=None)) – The number of samples in a neighborhood for a point to be considered as a core point. This includes the point itself. Defaults to the min_cluster_size.

  • metric (str or callable, optional (default='minkowski')) – The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.

  • p (int, optional (default=2)) – p value to use if using the minkowski metric.

  • alpha (float, optional (default=1.0)) – A distance scaling parameter as used in robust single linkage. See [2]_ for more information.

  • algorithm (str, optional (default='best')) –

    Exactly which algorithm to use; hdbscan has variants specialised for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:

    • best

    • generic

    • prims_kdtree

    • prims_balltree

    • boruvka_kdtree

    • boruvka_balltree

  • leaf_size (int, optional (default=40)) – Leaf size for trees responsible for fast nearest neighbour queries.

  • approx_min_span_tree (bool, optional (default=True)) – Whether to accept an only approximate minimum spanning tree. For some algorithms this can provide a significant speedup, but the resulting clustering may be of marginally lower quality. If you are willing to sacrifice speed for correctness you may want to explore this; in general this should be left at the default True.

  • cluster_selection_method (str, optional (default='eom')) –

    The method used to select clusters from the condensed tree. The standard approach for FLASC is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

    • eom

    • leaf

  • allow_single_cluster (bool, optional (default=False)) – By default FLASC will not produce a single cluster, setting this to t=True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset. (default False)

  • cluster_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Clusters below this value will be merged. See [3]_ for more information. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.

  • max_cluster_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless) and can also be overridden in rare cases by a high value for cluster_selection_epsilon. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict function is not aware of this argument.

  • allow_single_branch (bool, optional (default=False)) – Analogous to allow_single_cluster. Note that depending on label_sides_as_branches FLASC requires at least 3 branches to exist in a cluster before they are incorporated in the final labelling.

  • branch_detection_method (str, optional (default=``full``)) –

    Deteremines which graph is conctructed to detect branches with. Valid values are, ordered by increasing computation cost and decreasing sensitivity to noise: - core: Contains the edges that connect each point to all other

    points within a mutual reachability distance lower than or equal to the point’s core distance. This is the cluster’s subgraph of the k-NN graph over the entire data set (with k = min_samples).

    • full: Contains all edges between points in each cluster with a mutual reachability distance lower than or equal to the distance of the most-distance point in each cluster. These graphs represent the 0-dimensional simplicial complex of each cluster at the first point in the filtration where they contain all their points.

  • branch_selection_method (str, optional (default='eom')) –

    The method used to select branches from the cluster’s condensed tree. The standard approach for FLASC is to use the eom approach. Options are:

    • eom

    • leaf

  • branch_selection_persistence (float, optional (default=0.0)) – An eccentricity persistence threshold. Branches with a persistence below this value will be merged. See [3]_ for more information. Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict), as the approximate_predict() function is not aware of this argument.

  • max_branch_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has no effect when using leaf clustering (where clusters are usually small regardless). Note that this should not be used if we want to predict the cluster labels for new points in future (e.g. using approximate_predict()), as that function is not aware of this argument.

  • label_sides_as_branches (bool, optional (default=False),) – When this flag is False, branches are only labelled for clusters with at least three branches (i.e., at least y-shapes). Clusters with only two branches represent l-shapes. The two branches describe the cluster’s outsides growing towards each other. Enableing this flag separates these branches from each other in the produced labelling.

  • override_cluster_labels (np.ndarray, optional (default=None)) –

    Override the FLASC clustering to specify your own grouping with a numpy array containing a cluster label for each data point. Negative values will be interpreted as noise points. When the parameter is not set to None, core distances are computed over all data points, minimum spanning trees and the branches are computed per cluster. Consequently, the manually specified clusters do not have to form neatly separable connected components in the minimum spanning tree over all the data points.

    Because the clustering step is skipped, some of the output variables and the approximate_predict() function will be unavailable: - cluster_persistence - condensed_tree - single_linkage_tree - min_spanning_tree

  • override_cluster_probabilities (np.ndarray, optional (default=None)) – Specifying a not None value for this parameter is only valid when override_cluster_labels is used. In that case, this parameter controls the data point cluster membership probabilities. When this parameter is None, a default 1.0 probability is used for all points.

  • memory (instance of joblib.Memory or str, optional) – Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.

  • num_jobs (int, optional (default=None)) – Number of parallel jobs to run in core distance computations and branch detection step. For num_jobs below -1, (n_cpus + 1 + num_jobs) are used. By default, the algorithm tries to estimate whether the given input is large enough for multi-processing to have a benefit. If so, 4 processes are started, otherwise only the main process is used. When a num_jobs value is given, that number of jobs is used regardless of the input size.

  • **kwargs (optional) – Additional arguments passed to hdbscan() or the distance metric.

Returns:

  • labels (np.ndarray, shape (n_samples, )) – Cluster+branch labels for each point. Noisy samples are given the label -1.

  • probabilities (np.ndarray, shape (n_samples, )) – Cluster+branch membership strengths for each point. Noisy samples are assigned 0.

  • cluster_labels (np.ndarray, shape (n_samples, )) – Cluster labels for each point. Noisy samples are given the label -1.

  • cluster_probabilities (np.ndarray, shape (n_samples, )) – Cluster membership strengths for each point. Noisy samples are assigned 0.

  • cluster_persistence (array, shape (n_clusters, )) – A score of how persistent each cluster is. A score of 1.0 represents a perfectly stable cluster that persists over all distance scales, while a score of 0.0 represents a perfectly ephemeral cluster. These scores gauge the relative coherence of the clusters output by the algorithm. Not available when override_cluster_labels is used.

  • branch_labels (np.ndarray, shape (n_samples, )) – Branch labels for each point. Noisy samples are given the label -1.

  • branch_probabilities (np.ndarray, shape (n_samples, )) – Branch membership strengths for each point. Noisy samples are assigned 0.

  • branch_persistences (tuple (n_clusters)) – A branch persistence (eccentricity range) for each detected branch.

  • condensed_tree (record array) – The condensed cluster hierarchy used to generate clusters. Not available when override_cluster_labels is used.

  • single_linkage_tree (np.ndarray, shape (n_samples - 1, 4)) – The single linkage tree produced during clustering in scipy hierarchical clustering format (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Not available when override_cluster_labels is used.

  • min_spanning_tree (np.ndarray, shape (n_samples - 1, 3)) – The minimum spanning as an edgelist. Not available when override_cluster_labels is used.

  • cluster_approximation_graphs (tuple (n_clusters)) – The graphs used to detect branches in each cluster stored as a numpy array with four columns: source, target, centrality, mutual reachability distance. Points are labelled by their row-index into the input data. The edges contained in the graphs depend on the branch_detection_method: - core: Contains the edges that connect each point to all other

    points in a cluster within a mutual reachability distance lower than or equal to the point’s core distance. This is an extension of the minimum spanning tree introducing only edges with equal distances. The reachability distance introduces num_points * min_samples of such edges.

    • full: Contains all edges between points in each cluster with a mutual reachability distance lower than or equal to the distance of the most-distance point in each cluster. These graphs represent the 0-dimensional simplicial complex of each cluster at the first point in the filtration where they contain all their points.

  • cluster_condensed_trees (tuple (n_clusters)) – A condensed branch hierarchy for each cluster produced during the branch detection step. Data points are numbered with in-cluster ids.

  • cluster_linkage_trees (tuple (n_clusters)) – A single linkage tree for each cluster produced during the branch detection step, in the scipy hierarchical clustering format. (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Data points are numbered with in-cluster ids.

  • cluster_centralities (np.ndarray, shape (n_samples, )) – Centrality values for each point in a cluster. Overemphasizes points’ eccentricity within the cluster as the values are based on minimum spanning trees that do not contain the equally distanced edges resulting from the mutual reachability distance.

  • cluster_points (list (n_clusters)) – The data point row indices for each cluster.

References

Predictions

Predictions functions for FLASC.

See also hdbscan.prediction.

flasc.prediction.approximate_predict(clusterer, points_to_predict)

Predict the cluster label of new points. The returned labels will be those of the original clustering found by clusterer, and therefore are not (necessarily) the cluster labels that would be found by clustering the original data combined with points_to_predict, hence the ‘approximate’ label.

If you simply wish to assign new points to an existing clustering in the ‘best’ way possible, this is the function to use. If you want to predict how points_to_predict would cluster with the original data under FLASC the most efficient existing approach is to simply recluster with the new point(s) added to the original dataset.

Parameters:
  • clusterer (FLASC) – A clustering object that has been fit to vector inpt data.

  • points_to_predict (array, or array-like (n_samples, n_features)) – The new data points to predict cluster labels for. They should have the same dimensionality as the original dataset over which clusterer was fit.

Returns:

  • labels (array (n_samples,)) – The predicted labels of the points_to_predict

  • probabilities (array (n_samples,)) – The soft cluster scores for each of the points_to_predict

  • cluster_labels (array (n_samples,)) – The predicted cluster labels of the points_to_predict

  • cluster_probabilities (array (n_samples,)) – The soft cluster scores for each of the points_to_predict

  • branch_labels (array (n_samples,)) – The predicted cluster labels of the points_to_predict

  • branch_probabilities (array (n_samples,)) – The soft cluster scores for each of the points_to_predict

flasc.prediction.branch_centrality_vectors(clusterer)

Predict soft branch-membership vectors for all points in the clusters with more than two detected branches. Computes geodesic traversal depth from branch-roots to all points in the cluster.

Parameters:

clusterer (FLASC) – A clusterer object that has been fit to vector input data.

Returns:

centrality_vectors – The centrality value of point i in cluster c of the original dataset from the root of branch j is in membership_vectors[c][i, j].

Return type:

list[array (n_samples, n_branches)]

flasc.prediction.branch_membership_from_centrality(branch_centrality_vectors, temperature=1.0)

Scales the branch centrality vectors to act as probability using softmax.

\[\mathbf{m} = \frac{ e^{\mathbf{c} / t} }{ \sum_{i}{e^{c_i / t}} }\]

where \(\mathbf{m}\) is the scaled membership vector and \(\mathbf{c}\) is the branch centrality vector.

Parameters:

branch_centrality_vectors (list[array (n_samples, n_branches)]) – The centrality values from the centroids of a cluster’s branches. None if the cluster has two or fewer branches.

temperaturefloat, default=1.0

A scaling factor for the softmax function. A higher temperature makes the distribution more uniform, a lower temperature makes the distribution more peaky.

Returns:

scaled_branch_memberships – The probabilities of a point belonging to the cluster’s branches. None if the cluster has two or fewer branches.

Return type:

list[array (n_samples, n_branches)]

flasc.prediction.update_labels_with_branch_centrality(clusterer, branch_centrality_vectors)

Updates the clusterer’s labels and branch labels by assigning central points with a noise branch label to geodesically closest branch root. Only changing points with a noise branch label makes sure than points cannot move to a sibling branch in the branch hierarchy.

Parameters:
  • clusterer (FLASC) – A clustering object that has been fit to data.

  • branch_centrality_vectors (list[array (n_samples, n_branches)]) – The centrality values from the centroids of a cluster’s branches. None if the cluster has two or fewer branches.

Returns:

labels – Updated cluster+branch labels for each point. Noisy samples are given the label -1.

Return type:

np.ndarray, shape (n_samples, )

cluster_labelsnp.ndarray, shape (n_samples, )

Updated luster labels for each point. Noisy samples are given the label -1.