FLASC* is a branch-aware clustering algorithm, that builds upon
hdbscan to detect branching structures within clusters. The
algorithm returns a labelling that separates noise points from clusters and
branches from each other. In addition, the
single-linkage
and
condensedlinkage
hierarchies are provided for both the clustering as the branch detection stages.
Performs hdbscan clustering with flare detection post-processing step.
FLASC - Flare-Sensitive Clustering. Performs hdbscan clustering
[1]_ with a post-processing step to detect branches within individual
clusters. For each cluster, a graph is constructed connecting the data
points based on their mutual reachability distances. Each edge is given a
centrality value based on how far it lies from the cluster’s center. Then,
the edges are clustered as if that centrality was a density, progressively
removing the ‘centre’ of each cluster and seeing how many branches remain.
Parameters:
X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if
metric='precomputed'.
min_cluster_size (int, optional (default=5)) – The minimum number of samples in a group for that group to be
considered a cluster; groupings smaller than this size will be left
as noise.
min_branch_size (int, optional (default=None)) – The minimum number of samples in a group for that group to be
considered a branch; groupings smaller than this size will seen as
points falling out of a branch. Defaults to the min_cluster_size.
min_samples (int, optional (default=None)) – The number of samples in a neighborhood for a point
to be considered as a core point. This includes the point itself.
Defaults to the min_cluster_size.
metric (str or callable, optional (default='minkowski')) – The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.pairwise_distances for its
metric parameter.
If metric is “precomputed”, X is assumed to be a distance matrix and
must be square.
p (int, optional (default=2)) – p value to use if using the minkowski metric.
alpha (float, optional (default=1.0)) – A distance scaling parameter as used in robust single linkage.
See [2]_ for more information.
Exactly which algorithm to use; hdbscan has variants specialised
for different characteristics of the data. By default this is set
to best which chooses the “best” algorithm given the nature of
the data. You can force other options if you believe you know
better. Options are:
best
generic
prims_kdtree
prims_balltree
boruvka_kdtree
boruvka_balltree
leaf_size (int, optional (default=40)) – Leaf size for trees responsible for fast nearest
neighbour queries.
approx_min_span_tree (bool, optional (default=True)) – Whether to accept an only approximate minimum spanning tree.
For some algorithms this can provide a significant speedup, but
the resulting clustering may be of marginally lower quality.
If you are willing to sacrifice speed for correctness you may want
to explore this; in general this should be left at the default True.
The method used to select clusters from the condensed tree. The
standard approach for FLASC is to use an Excess of Mass algorithm
to find the most persistent clusters. Alternatively you can instead
select the clusters at the leaves of the tree – this provides the
most fine grained and homogeneous clusters. Options are:
eom
leaf
allow_single_cluster (bool, optional (default=False)) – By default HDBSCAN* will not produce a single cluster, setting this
to t=True will override this and allow single cluster results in
the case that you feel this is a valid result for your dataset.
(default False)
cluster_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Clusters below this value will be merged.
See [3]_ for more information. Note that this should not be used
if we want to predict the cluster labels for new points in future
(e.g. using approximate_predict()), as the
that function is not aware of this argument.
max_cluster_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm.
Has no effect when using leaf clustering (where clusters are
usually small regardless) and can also be overridden in rare
cases by a high value for cluster_selection_epsilon. Note that
this should not be used if we want to predict the cluster labels
for new points in future
(e.g. using approximate_predict()), as
the approximate_predict function is not aware of this argument.
allow_single_branch (bool, optional (default=False)) – Analogous to allow_single_cluster. Note that depending on
label_sides_as_branches FLASC* requires at least 3 branches to
exist in a cluster before they are incorporated in the final labelling.
Deteremines which graph is conctructed to detect branches with. Valid
values are, ordered by increasing computation cost and decreasing
sensitivity to noise:
core: Contains the edges that connect each point to all other
points within a mutual reachability distance lower than or equal to
the point’s core distance. This is the cluster’s subgraph of the
k-NN graph over the entire data set (with k = min_samples).
full: Contains all edges between points in each cluster with a
mutual reachability distance lower than or equal to the distance of
the most-distance point in each cluster. These graphs represent the
0-dimensional simplicial complex of each cluster at the first point in
the filtration where they contain all their points.
branch_selection_persistence (float, optional (default=0.0)) – A centrality persistence threshold. Branches with a persistence below
this value will be merged. See [3]_ for more information. Note that this
should not be used if we want to predict the cluster labels for new
points in future (e.g. using approximate_predict), as the
approximate_predict() function
is not aware of this argument.
The method used to select branches from the cluster’s condensed tree.
The standard approach for FLASC* is to use the eom approach.
Options are:
eom
leaf
max_branch_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm.
Has no effect when using leaf clustering (where clusters are
usually small regardless). Note that this should not be used if we
want to predict the cluster labels for new points in future (e.g. using
approximate_predict()), as that function is
not aware of this argument.
label_sides_as_branches (bool, optional (default=False),) – When this flag is False, branches are only labelled for clusters with at
least three branches (i.e., at least y-shapes). Clusters with only two
branches represent l-shapes. The two branches describe the cluster’s
outsides growing towards each other. Enableing this flag separates these
branches from each other in the produced labelling.
Override the HDBSCAN* clustering to specify your own grouping with a
numpy array containing a cluster label for each data point. Negative
values will be interpreted as noise points. When the parameter is not set
to None, core distances are computed over all data points,
minimum spanning trees and the branches are computed per cluster.
Consequently, the manually specified clusters do not have to form
neatly separable connected components in the minimum spanning tree
over all the data points.
override_cluster_probabilities (np.ndarray, optional (default=None)) – Specifying a not None value for this parameter is only valid when
override_cluster_labels is used. In that case, this parameter
controls the data point cluster membership probabilities. When this
parameter is None, a default 1.0 probability is used for all points.
memory (instance of joblib.Memory or str, optional) – Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
path to the caching directory.
num_jobs (int, optional (default=None)) – Number of parallel jobs to run in core distance computations and branch
detection step. For num_jobs below -1, (n_cpus + 1 + num_jobs) are
used. By default, the algorithm tries to estimate whether the given input
is large enough for multi-processing to have a benefit. If so, 4 processes
are started, otherwise only the main process is used. When a num_jobs
value is given, that number of jobs is used regardless of the input size.
**kwargs (optional) – Additional arguments passed to hdbscan() or the
distance metric.
A score of how persistent each cluster is. A score of 1.0 represents
a perfectly stable cluster that persists over all distance scales,
while a score of 0.0 represents a perfectly ephemeral cluster. These
scores gauge the relative coherence of the clusters output by the
algorithm. Not available when override_cluster_labels is used.
The graphs used to detect branches in each cluster as an
ApproximationGraph. Can be converted to
a networkx graph, pandas data frame, or a list with numpy array-edgelists.
Points are labelled by their row-index into the input data. The edges
contained in the graph depend on the branch_detection_method:
- core: Contains the edges that connect each point to all other
points in a cluster within a mutual reachability distance lower than
or equal to the point’s core distance. This is an extension of the
minimum spanning tree introducing only edges with equal distances. The
reachability distance introduces num_points * min_samples of
such edges.
full: Contains all edges between points in each cluster with a
mutual reachability distance lower than or equal to the distance of
the most-distance point in each cluster. These graphs represent the
0-dimensional simplicial complex of each cluster at the first point in
the filtration where they contain all their points.
Centrality values for each point in a cluster. Overemphasizes points’
eccentricity within the cluster as the values are based on minimum
spanning trees that do not contain the equally distanced edges resulting
from the mutual reachability distance.
A list of exemplar points for clusters. Since HDBSCAN supports
arbitrary shapes for clusters we cannot provide a single cluster
exemplar per cluster. Instead a list is returned with each element
of the list being a numpy array of exemplar points for a cluster –
these points are the “most representative” points of the cluster.
Not available when override_cluster_labels is used or a precomputed
distance matrix is given as input.
A list with exemplar points for the branches in the clusters. A cluster’s
item is empty if it does not have selected branches. For clusters with
selected branches, a list with a numpy array of exemplar points for each
selected branch is given.
HDBSCAN’s fast approximation of the Density Based Cluster Validity (
DBCV) score [4] on FLASC’s labelling. It may only be used to compare
results across different choices of hyper-parameters, therefore
is only a relative score.
An HDBSCAN clusterer object fitted to the data. Can be used to compute
outlier scores and cluster exemplars. Not available when
override_cluster_labels is used.
X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if
metric='precomputed'.
Performs clustering on X and returns cluster labels.
Parameters:
X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if
metric='precomputed'.
Provides an approximate representative point for a given branch.
Note that this technique assumes a euclidean metric for speed of
computation. For more general metrics use the
weighted_medoid()
method which is slower, but can work with the metric the model trained
with.
Parameters:
label_id (int) – The id of the cluster to compute a centroid for.
data (np.ndarray (n_samples, n_features), optional (default=None)) – A dataset to use instead of the raw data that was clustered on.
Returns:
centroid – A representative centroid for cluster label_id.
Provides an approximate representative point for a given cluster. Note
that this technique assumes a euclidean metric for speed of computation.
For more general metrics use the
weighted_cluster_medoid()
method which is slower, but can work with the metric the model trained
with.
Parameters:
cluster_id (int) – The id of the cluster to compute a centroid for.
Returns:
centroid – A representative centroid for cluster cluster_id.
Provides an approximate representative point for a given cluster.
Note that this technique can be very slow and memory intensive for
large clusters. For faster results use the
weighted_cluster_centroid()
method which is faster, but assumes a euclidean metric.
Parameters:
cluster_id (int) – The id of the cluster to compute a medoid for.
Returns:
centroid – A representative medoid for cluster cluster_id.
Provides an approximate representative point for a given branch.
Note that this technique can be very slow and memory intensive for
large clusters. For faster results use the
weighted_centroid()
method which is faster, but assumes a euclidean metric.
Parameters:
label_id (int) – The id of the cluster to compute a medoid for.
data (np.ndarray (n_samples, n_features), optional (default=None)) – A dataset to use instead of the raw data that was clustered on.
Returns:
centroid – A representative medoid for cluster label_id.
Performs FLASC clustering with flare detection post-processing step.
FLASC - Flare-Sensitive Clustering.
Performs hdbscan clustering [1]_ with a post-processing step to
detect branches within individual clusters. For each cluster, a graph is
constructed connecting the data points based on their mutual reachability
distances. Each edge is given a centrality value based on how far it lies
from the cluster’s center. Then, the edges are clustered as if that
centrality was a distance, progressively removing the ‘center’ of each
cluster and seeing how many branches remain.
Parameters:
X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if
metric='precomputed'.
min_cluster_size (int, optional (default=5)) – The minimum number of samples in a group for that group to be
considered a cluster; groupings smaller than this size will be left
as noise.
min_branch_size (int, optional (default=None)) – The minimum number of samples in a group for that group to be
considered a branch; groupings smaller than this size will seen as
points falling out of a branch. Defaults to the min_cluster_size.
min_samples (int, optional (default=None)) – The number of samples in a neighborhood for a point
to be considered as a core point. This includes the point itself.
Defaults to the min_cluster_size.
metric (str or callable, optional (default='minkowski')) – The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.pairwise_distances for its
metric parameter.
If metric is “precomputed”, X is assumed to be a distance matrix and
must be square.
p (int, optional (default=2)) – p value to use if using the minkowski metric.
alpha (float, optional (default=1.0)) – A distance scaling parameter as used in robust single linkage.
See [2]_ for more information.
Exactly which algorithm to use; hdbscan has variants specialised
for different characteristics of the data. By default this is set
to best which chooses the “best” algorithm given the nature of
the data. You can force other options if you believe you know
better. Options are:
best
generic
prims_kdtree
prims_balltree
boruvka_kdtree
boruvka_balltree
leaf_size (int, optional (default=40)) – Leaf size for trees responsible for fast nearest
neighbour queries.
approx_min_span_tree (bool, optional (default=True)) – Whether to accept an only approximate minimum spanning tree.
For some algorithms this can provide a significant speedup, but
the resulting clustering may be of marginally lower quality.
If you are willing to sacrifice speed for correctness you may want
to explore this; in general this should be left at the default True.
The method used to select clusters from the condensed tree. The
standard approach for FLASC is to use an Excess of Mass algorithm
to find the most persistent clusters. Alternatively you can instead
select the clusters at the leaves of the tree – this provides the
most fine grained and homogeneous clusters. Options are:
eom
leaf
allow_single_cluster (bool, optional (default=False)) – By default FLASC will not produce a single cluster, setting this
to t=True will override this and allow single cluster results in
the case that you feel this is a valid result for your dataset.
(default False)
cluster_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Clusters below this value will be merged.
See [3]_ for more information. Note that this should not be used
if we want to predict the cluster labels for new points in future
(e.g. using approximate_predict), as the approximate_predict function
is not aware of this argument.
max_cluster_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm.
Has no effect when using leaf clustering (where clusters are
usually small regardless) and can also be overridden in rare
cases by a high value for cluster_selection_epsilon. Note that
this should not be used if we want to predict the cluster labels
for new points in future (e.g. using approximate_predict), as
the approximate_predict function is not aware of this argument.
allow_single_branch (bool, optional (default=False)) – Analogous to allow_single_cluster. Note that depending on
label_sides_as_branches FLASC requires at least 3 branches to
exist in a cluster before they are incorporated in the final labelling.
Deteremines which graph is conctructed to detect branches with. Valid
values are, ordered by increasing computation cost and decreasing
sensitivity to noise:
- core: Contains the edges that connect each point to all other
points within a mutual reachability distance lower than or equal to
the point’s core distance. This is the cluster’s subgraph of the
k-NN graph over the entire data set (with k = min_samples).
full: Contains all edges between points in each cluster with a
mutual reachability distance lower than or equal to the distance of
the most-distance point in each cluster. These graphs represent the
0-dimensional simplicial complex of each cluster at the first point in
the filtration where they contain all their points.
The method used to select branches from the cluster’s condensed tree.
The standard approach for FLASC is to use the eom approach.
Options are:
eom
leaf
branch_selection_persistence (float, optional (default=0.0)) – An eccentricity persistence threshold. Branches with a persistence below
this value will be merged. See [3]_ for more information. Note that this
should not be used if we want to predict the cluster labels for new
points in future (e.g. using approximate_predict), as the
approximate_predict() function is not aware of
this argument.
max_branch_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm.
Has no effect when using leaf clustering (where clusters are
usually small regardless). Note that this should not be used if we
want to predict the cluster labels for new points in future (e.g. using
approximate_predict()), as that function is
not aware of this argument.
label_sides_as_branches (bool, optional (default=False),) – When this flag is False, branches are only labelled for clusters with at
least three branches (i.e., at least y-shapes). Clusters with only two
branches represent l-shapes. The two branches describe the cluster’s
outsides growing towards each other. Enableing this flag separates these
branches from each other in the produced labelling.
Override the FLASC clustering to specify your own grouping with a
numpy array containing a cluster label for each data point. Negative
values will be interpreted as noise points. When the parameter is not set
to None, core distances are computed over all data points,
minimum spanning trees and the branches are computed per cluster.
Consequently, the manually specified clusters do not have to form
neatly separable connected components in the minimum spanning tree
over all the data points.
Because the clustering step is skipped, some of the output variables
and the approximate_predict() function will
be unavailable:
- cluster_persistence
- condensed_tree
- single_linkage_tree
- min_spanning_tree
override_cluster_probabilities (np.ndarray, optional (default=None)) – Specifying a not None value for this parameter is only valid when
override_cluster_labels is used. In that case, this parameter
controls the data point cluster membership probabilities. When this
parameter is None, a default 1.0 probability is used for all points.
memory (instance of joblib.Memory or str, optional) – Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
path to the caching directory.
num_jobs (int, optional (default=None)) – Number of parallel jobs to run in core distance computations and branch
detection step. For num_jobs below -1, (n_cpus + 1 + num_jobs) are
used. By default, the algorithm tries to estimate whether the given input
is large enough for multi-processing to have a benefit. If so, 4 processes
are started, otherwise only the main process is used. When a num_jobs
value is given, that number of jobs is used regardless of the input size.
**kwargs (optional) – Additional arguments passed to hdbscan() or the
distance metric.
Returns:
labels (np.ndarray, shape (n_samples, )) – Cluster+branch labels for each point. Noisy samples are given the
label -1.
probabilities (np.ndarray, shape (n_samples, )) – Cluster+branch membership strengths for each point. Noisy samples are
assigned 0.
cluster_labels (np.ndarray, shape (n_samples, )) – Cluster labels for each point. Noisy samples are given the label -1.
cluster_probabilities (np.ndarray, shape (n_samples, )) – Cluster membership strengths for each point. Noisy samples are
assigned 0.
cluster_persistence (array, shape (n_clusters, )) – A score of how persistent each cluster is. A score of 1.0 represents
a perfectly stable cluster that persists over all distance scales,
while a score of 0.0 represents a perfectly ephemeral cluster. These
scores gauge the relative coherence of the clusters output by the
algorithm. Not available when override_cluster_labels is used.
branch_labels (np.ndarray, shape (n_samples, )) – Branch labels for each point. Noisy samples are given the label -1.
branch_probabilities (np.ndarray, shape (n_samples, )) – Branch membership strengths for each point. Noisy samples are
assigned 0.
branch_persistences (tuple (n_clusters)) – A branch persistence (eccentricity range) for each detected branch.
condensed_tree (record array) – The condensed cluster hierarchy used to generate clusters.
Not available when override_cluster_labels is used.
single_linkage_tree (np.ndarray, shape (n_samples - 1, 4)) – The single linkage tree produced during clustering in scipy
hierarchical clustering format
(see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html).
Not available when override_cluster_labels is used.
min_spanning_tree (np.ndarray, shape (n_samples - 1, 3)) – The minimum spanning as an edgelist. Not available when
override_cluster_labels is used.
cluster_approximation_graphs (tuple (n_clusters)) – The graphs used to detect branches in each cluster stored as a numpy
array with four columns: source, target, centrality, mutual reachability
distance. Points are labelled by their row-index into the input data.
The edges contained in the graphs depend on the branch_detection_method:
- core: Contains the edges that connect each point to all other
points in a cluster within a mutual reachability distance lower than
or equal to the point’s core distance. This is an extension of the
minimum spanning tree introducing only edges with equal distances. The
reachability distance introduces num_points * min_samples of
such edges.
full: Contains all edges between points in each cluster with a
mutual reachability distance lower than or equal to the distance of
the most-distance point in each cluster. These graphs represent the
0-dimensional simplicial complex of each cluster at the first point in
the filtration where they contain all their points.
cluster_condensed_trees (tuple (n_clusters)) – A condensed branch hierarchy for each cluster produced during the
branch detection step. Data points are numbered with in-cluster ids.
cluster_linkage_trees (tuple (n_clusters)) – A single linkage tree for each cluster produced during the branch
detection step, in the scipy hierarchical clustering format.
(see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html).
Data points are numbered with in-cluster ids.
cluster_centralities (np.ndarray, shape (n_samples, )) – Centrality values for each point in a cluster. Overemphasizes points’
eccentricity within the cluster as the values are based on minimum
spanning trees that do not contain the equally distanced edges resulting
from the mutual reachability distance.
cluster_points (list (n_clusters)) – The data point row indices for each cluster.
Predict the cluster label of new points. The returned labels
will be those of the original clustering found by clusterer,
and therefore are not (necessarily) the cluster labels that would
be found by clustering the original data combined with
points_to_predict, hence the ‘approximate’ label.
If you simply wish to assign new points to an existing clustering
in the ‘best’ way possible, this is the function to use. If you
want to predict how points_to_predict would cluster with
the original data under FLASC the most efficient existing approach
is to simply recluster with the new point(s) added to the original dataset.
Parameters:
clusterer (FLASC) – A clustering object that has been fit to vector inpt data.
points_to_predict (array, or array-like (n_samples, n_features)) – The new data points to predict cluster labels for. They should
have the same dimensionality as the original dataset over which
clusterer was fit.
Returns:
labels (array (n_samples,)) – The predicted labels of the points_to_predict
probabilities (array (n_samples,)) – The soft cluster scores for each of the points_to_predict
cluster_labels (array (n_samples,)) – The predicted cluster labels of the points_to_predict
cluster_probabilities (array (n_samples,)) – The soft cluster scores for each of the points_to_predict
branch_labels (array (n_samples,)) – The predicted cluster labels of the points_to_predict
branch_probabilities (array (n_samples,)) – The soft cluster scores for each of the points_to_predict
Predict soft branch-membership vectors for all points in the clusters
with more than two detected branches. Computes geodesic traversal depth
from branch-roots to all points in the cluster.
Parameters:
clusterer (FLASC) – A clusterer object that has been fit to vector input data.
Returns:
centrality_vectors – The centrality value of point i in cluster c of the original
dataset from the root of branch j is in
membership_vectors[c][i,j].
where \(\mathbf{m}\) is the scaled membership vector and
\(\mathbf{c}\) is the branch centrality vector.
Parameters:
branch_centrality_vectors (list[array (n_samples, n_branches)]) – The centrality values from the centroids of a cluster’s branches.
None if the cluster has two or fewer branches.
temperaturefloat, default=1.0
A scaling factor for the softmax function. A higher temperature
makes the distribution more uniform, a lower temperature makes
the distribution more peaky.
Returns:
scaled_branch_memberships – The probabilities of a point belonging to the cluster’s branches.
None if the cluster has two or fewer branches.
Updates the clusterer’s labels and branch labels by assigning
central points with a noise branch label to geodesically closest branch
root. Only changing points with a noise branch label makes sure than points
cannot move to a sibling branch in the branch hierarchy.
Parameters:
clusterer (FLASC) – A clustering object that has been fit to data.
branch_centrality_vectors (list[array (n_samples, n_branches)]) – The centrality values from the centroids of a cluster’s branches.
None if the cluster has two or fewer branches.
Returns:
labels – Updated cluster+branch labels for each point. Noisy samples are
given the label -1.
Return type:
np.ndarray, shape (n_samples, )
cluster_labelsnp.ndarray, shape (n_samples, )
Updated luster labels for each point. Noisy samples are given
the label -1.