FLASC* is a branch-aware clustering algorithm, that builds upon
hdbscan to detect branching structures within clusters. The
algorithm returns a labelling that separates noise points from clusters and
branches from each other. In addition, the
single-linkage
and
condensedlinkage
hierarchies are provided for both the clustering as the branch detection stages.
Performs hdbscan clustering with flare detection post-processing step.
FLASC - Flare-Sensitive Clustering. Performs hdbscan clustering
[1]_ with a post-processing step to detect branches within individual
clusters. For each cluster, a graph is constructed connecting the data
points based on their mutual reachability distances. Each edge is given a
centrality value based on how far it lies from the cluster’s center. Then,
the edges are clustered as if that centrality was a density, progressively
removing the ‘centre’ of each cluster and seeing how many branches remain.
Parameters:
X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if
metric='precomputed'.
min_cluster_size (int, optional (default=5)) – The minimum number of samples in a group for that group to be
considered a cluster; groupings smaller than this size will be left
as noise.
min_branch_size (int, optional (default=None)) – The minimum number of samples in a group for that group to be
considered a branch; groupings smaller than this size will seen as
points falling out of a branch. Defaults to the min_cluster_size.
min_samples (int, optional (default=None)) – The number of samples in a neighborhood for a point
to be considered as a core point. This includes the point itself.
Defaults to the min_cluster_size.
metric (str or callable, optional (default='minkowski')) – The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.pairwise_distances for its
metric parameter.
If metric is “precomputed”, X is assumed to be a distance matrix and
must be square.
p (int, optional (default=2)) – p value to use if using the minkowski metric.
alpha (float, optional (default=1.0)) – A distance scaling parameter as used in robust single linkage.
See [2]_ for more information.
Exactly which algorithm to use; hdbscan has variants specialized
for different characteristics of the data. By default this is set
to best which chooses the “best” algorithm given the nature of
the data. You can force other options if you believe you know
better. Options are:
best
generic
prims_kdtree
prims_balltree
boruvka_kdtree
boruvka_balltree
leaf_size (int, optional (default=40)) – Leaf size for trees responsible for fast nearest
neighbor queries.
approx_min_span_tree (bool, optional (default=True)) – Whether to accept an only approximate minimum spanning tree.
For some algorithms this can provide a significant speedup, but
the resulting clustering may be of marginally lower quality.
If you are willing to sacrifice speed for correctness you may want
to explore this; in general this should be left at the default True.
The method used to select clusters from the condensed tree. The
standard approach for FLASC is to use an Excess of Mass algorithm
to find the most persistent clusters. Alternatively you can instead
select the clusters at the leaves of the tree – this provides the
most fine grained and homogeneous clusters. Options are:
eom
leaf
allow_single_cluster (bool, optional (default=False)) – By default HDBSCAN* will not produce a single cluster, setting this
to True will override this and allow single cluster results in
the case that you feel this is a valid result for your dataset.
(default False)
cluster_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Clusters below this value will be merged.
See [3]_ for more information. Note that this should not be used
if we want to predict the cluster labels for new points in future
(e.g. using approximate_predict()), as the
that function is not aware of this argument.
max_cluster_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm.
Has no effect when using leaf clustering (where clusters are
usually small regardless) and can also be overridden in rare
cases by a high value for cluster_selection_epsilon. Note that
this should not be used if we want to predict the cluster labels
for new points in future
(e.g. using approximate_predict()), as
the approximate_predict function is not aware of this argument.
allow_single_branch (bool, optional (default=False)) – Analogous to allow_single_cluster. Note that depending on
label_sides_as_branches FLASC* requires at least 3 branches to
exist in a cluster before they are incorporated in the final labelling.
Determines which graph is constructed to detect branches with. Valid
values are, ordered by increasing computation cost and decreasing
sensitivity to noise:
core: Contains the edges that connect each point to all other
points within a mutual reachability distance lower than or equal to
the point’s core distance. This is the cluster’s subgraph of the
k-NN graph over the entire data set (with k = min_samples).
full: Contains all edges between points in each cluster with a
mutual reachability distance lower than or equal to the distance of
the most-distance point in each cluster. These graphs represent the
0-dimensional simplicial complex of each cluster at the first point in
the filtration where they contain all their points.
branch_selection_persistence (float, optional (default=0.0)) – A centrality persistence threshold. Branches with a persistence below
this value will be merged. See [3]_ for more information. Note that this
should not be used if we want to predict the cluster labels for new
points in future (e.g. using approximate_predict), as the
approximate_predict() function
is not aware of this argument.
The method used to select branches from the cluster’s condensed tree.
The standard approach for FLASC* is to use the eom approach.
Options are:
eom
leaf
max_branch_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm.
Has no effect when using leaf clustering (where clusters are
usually small regardless). Note that this should not be used if we
want to predict the cluster labels for new points in future (e.g. using
approximate_predict()), as that function is
not aware of this argument.
label_sides_as_branches (bool, optional (default=False),) – When this flag is False, branches are only labelled for clusters with at
least three branches (i.e., at least y-shapes). Clusters with only two
branches represent l-shapes. The two branches describe the cluster’s
outsides growing towards each other. Enabling this flag separates these
branches from each other in the produced labelling.
memory (instance of joblib.Memory or str, optional) – Used to cache the output of the computation of the tree.
By default, no caching is done. If a string is given, it is the
path to the caching directory.
num_jobs (int, optional (default=None)) – Number of parallel jobs to run in core distance computations and branch
detection step. For num_jobs below -1, (n_cpus + 1 + num_jobs) are
used. By default, the algorithm tries to estimate whether the given input
is large enough for multi-processing to have a benefit. If so, 4 processes
are started, otherwise only the main process is used. When a num_jobs
value is given, that number of jobs is used regardless of the input size.
**kwargs (optional) – Additional arguments passed to hdbscan() or the
distance metric.
A score of how persistent each cluster is. A score of 1.0 represents
a perfectly stable cluster that persists over all distance scales,
while a score of 0.0 represents a perfectly ephemeral cluster. These
scores gauge the relative coherence of the clusters output by the
algorithm. Not available when override_cluster_labels is used.
The graphs used to detect branches in each cluster as an
ApproximationGraph. Can be converted to
a networkx graph, pandas data frame, or a list with numpy array-edgelists.
Points are labelled by their row-index into the input data. The edges
contained in the graph depend on the branch_detection_method:
- core: Contains the edges that connect each point to all other
points in a cluster within a mutual reachability distance lower than
or equal to the point’s core distance. This is an extension of the
minimum spanning tree introducing only edges with equal distances. The
reachability distance introduces num_points * min_samples of
such edges.
full: Contains all edges between points in each cluster with a
mutual reachability distance lower than or equal to the distance of
the most-distance point in each cluster. These graphs represent the
0-dimensional simplicial complex of each cluster at the first point in
the filtration where they contain all their points.
Centrality values for each point in a cluster. Overemphasizes points’
eccentricity within the cluster as the values are based on minimum
spanning trees that do not contain the equally distanced edges resulting
from the mutual reachability distance.
A list of exemplar points for clusters. Since HDBSCAN supports
arbitrary shapes for clusters we cannot provide a single cluster
exemplar per cluster. Instead a list is returned with each element
of the list being a numpy array of exemplar points for a cluster –
these points are the “most representative” points of the cluster.
Not available when override_cluster_labels is used or a precomputed
distance matrix is given as input.
A list with exemplar points for the branches in the clusters. A cluster’s
item is empty if it does not have selected branches. For clusters with
selected branches, a list with a numpy array of exemplar points for each
selected branch is given.
HDBSCAN’s fast approximation of the Density Based Cluster Validity (
DBCV) score [4] on FLASC’s labelling. It may only be used to compare
results across different choices of hyper-parameters, therefore
is only a relative score.
An HDBSCAN clusterer object fitted to the data. Can be used to compute
outlier scores and cluster exemplars. Not available when
override_cluster_labels is used.
X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if
metric='precomputed'.
labels (array of shape (n_samples, 1), or None) –
Labels that override the FLASC clustering to specify your own grouping with a numpy
array containing a cluster label for each data point. Negative values
will be interpreted as noise points. When the parameter is not set to
None, core distances are computed over all data points, minimum spanning
trees and the branches are computed per cluster. Consequently, the
manually specified clusters do not have to form neatly separable
connected components in the minimum spanning tree over all the data
points.
Because the clustering step is skipped, some of the output variables and
the approximate_predict() function will be
unavailable: - cluster_persistence - condensed_tree -
single_linkage_tree - min_spanning_tree
probabilities (array of shape (n_samples, 1), or None) – Specifying a not None value for this parameter is only valid when
override_cluster_labels is used. In that case, this parameter
controls the data point cluster membership probabilities. When this
parameter is None, a default 1.0 probability is used for all points.
Performs clustering on X and returns cluster labels.
Parameters:
X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples)) – A feature array, or array of distances between samples if
metric='precomputed'.
labels (array of shape (n_samples, 1), or None) –
Labels that override the FLASC clustering to specify your own grouping with a numpy
array containing a cluster label for each data point. Negative values
will be interpreted as noise points. When the parameter is not set to
None, core distances are computed over all data points, minimum spanning
trees and the branches are computed per cluster. Consequently, the
manually specified clusters do not have to form neatly separable
connected components in the minimum spanning tree over all the data
points.
Because the clustering step is skipped, some of the output variables and
the approximate_predict() function will be
unavailable: - cluster_persistence - condensed_tree -
single_linkage_tree - min_spanning_tree
probabilities (array of shape (n_samples, 1), or None) – Specifying a not None value for this parameter is only valid when
override_cluster_labels is used. In that case, this parameter
controls the data point cluster membership probabilities. When this
parameter is None, a default 1.0 probability is used for all points.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config()).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for labels parameter in fit.
probabilities (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for probabilities parameter in fit.
Provides an approximate representative point for a given branch.
Note that this technique assumes a euclidean metric for speed of
computation. For more general metrics use the
weighted_medoid()
method which is slower, but can work with the metric the model trained
with.
Parameters:
label_id (int) – The id of the cluster to compute a centroid for.
data (np.ndarray (n_samples, n_features), optional (default=None)) – A dataset to use instead of the raw data that was clustered on.
Returns:
centroid – A representative centroid for cluster label_id.
Provides an approximate representative point for a given cluster. Note
that this technique assumes a euclidean metric for speed of computation.
For more general metrics use the
weighted_cluster_medoid()
method which is slower, but can work with the metric the model trained
with.
Parameters:
cluster_id (int) – The id of the cluster to compute a centroid for.
Returns:
centroid – A representative centroid for cluster cluster_id.
Provides an approximate representative point for a given cluster.
Note that this technique can be very slow and memory intensive for
large clusters. For faster results use the
weighted_cluster_centroid()
method which is faster, but assumes a euclidean metric.
Parameters:
cluster_id (int) – The id of the cluster to compute a medoid for.
Returns:
centroid – A representative medoid for cluster cluster_id.
Provides an approximate representative point for a given branch.
Note that this technique can be very slow and memory intensive for
large clusters. For faster results use the
weighted_centroid()
method which is faster, but assumes a euclidean metric.
Parameters:
label_id (int) – The id of the cluster to compute a medoid for.
data (np.ndarray (n_samples, n_features), optional (default=None)) – A dataset to use instead of the raw data that was clustered on.
Returns:
centroid – A representative medoid for cluster label_id.
Performs FLASC clustering with flare detection post-processing step.
FLASC - Flare-Sensitive Clustering. Performs hdbscan clustering
[1]_ with a post-processing step to detect branches within individual
clusters. For each cluster, a graph is constructed connecting the data
points based on their mutual reachability distances. Each edge is given a
centrality value based on how far it lies from the cluster’s center. Then,
the edges are clustered as if that centrality was a distance, progressively
removing the ‘center’ of each cluster and seeing how many branches remain.
Parameters:
X (array of shape (n_samples, n_features), or array of shape (n_samples, n_samples) A feature array, or array of) – distances between samples if metric='precomputed'.
Override the FLASC clustering to specify your own grouping with a numpy
array containing a cluster label for each data point. Negative values
will be interpreted as noise points. When the parameter is not set to
None, core distances are computed over all data points, minimum spanning
trees and the branches are computed per cluster. Consequently, the
manually specified clusters do not have to form neatly separable
connected components in the minimum spanning tree over all the data
points.
Because the clustering step is skipped, some of the output variables and
the approximate_predict() function will be
unavailable: - cluster_persistence - condensed_tree -
single_linkage_tree - min_spanning_tree
override_cluster_probabilities (np.ndarray, optional (default=None)) – Specifying a not None value for this parameter is only valid when
override_cluster_labels is used. In that case, this parameter
controls the data point cluster membership probabilities. When this
parameter is None, a default 1.0 probability is used for all points.
min_cluster_size (int, optional (default=5)) – The minimum number of samples in a group for that group to be considered
a cluster; groupings smaller than this size will be left as noise.
min_branch_size (int, optional (default=None)) – The minimum number of samples in a group for that group to be considered
a branch; groupings smaller than this size will seen as points falling
out of a branch. Defaults to the min_cluster_size.
min_samples (int, optional (default=None)) – The number of samples in a neighborhood for a point to be considered as
a core point. This includes the point itself. Defaults to the
min_cluster_size.
metric (str or callable, optional (default='minkowski')) – The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of the
options allowed by metrics.pairwise.pairwise_distances for its metric
parameter. If metric is “precomputed”, X is assumed to be a distance
matrix and must be square.
p (int, optional (default=2)) – p value to use if using the minkowski metric.
alpha (float, optional (default=1.0)) – A distance scaling parameter as used in robust single linkage. See [2]_
for more information.
Exactly which algorithm to use; hdbscan has variants specialized for
different characteristics of the data. By default this is set to
best which chooses the “best” algorithm given the nature of the
data. You can force other options if you believe you know better.
Options are:
best
generic
prims_kdtree
prims_balltree
boruvka_kdtree
boruvka_balltree
leaf_size (int, optional (default=40)) – Leaf size for trees responsible for fast nearest neighbor queries.
approx_min_span_tree (bool, optional (default=True)) – Whether to accept an only approximate minimum spanning tree. For some
algorithms this can provide a significant speedup, but the resulting
clustering may be of marginally lower quality. If you are willing to
sacrifice speed for correctness you may want to explore this; in general
this should be left at the default True.
max_cluster_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has no
effect when using leaf clustering (where clusters are usually small
regardless) and can also be overridden in rare cases by a high value for
cluster_selection_epsilon. Note that this should not be used if we want
to predict the cluster labels for new points in future (e.g. using
approximate_predict), as the approximate_predict function is not aware
of this argument.
The method used to select clusters from the condensed tree. The standard
approach for FLASC is to use an Excess of Mass algorithm to find the
most persistent clusters. Alternatively you can instead select the
clusters at the leaves of the tree – this provides the most fine
grained and homogeneous clusters. Options are:
eom
leaf
allow_single_cluster (bool, optional (default=False)) – By default FLASC will not produce a single cluster, setting this to
True will override this and allow single cluster results in the case
that you feel this is a valid result for your dataset. (default False)
cluster_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Clusters below this value will be merged. See [3]_
for more information. Note that this should not be used if we want to
predict the cluster labels for new points in future (e.g. using
approximate_predict), as the approximate_predict function is not aware
of this argument.
cluster_selection_persistence (float, optional (default=0.0)) – A persistence threshold. Clusters with a persistence below this value
will be merged. Note that this should not be used if we want to predict
the cluster labels for new points in future (e.g. using
approximate_predict), as the approximate_predict function is not aware
of this argument.
max_branch_size (int, optional (default=0)) – A limit to the size of clusters returned by the eom algorithm. Has
no effect when using leaf clustering (where clusters are usually
small regardless). Note that this should not be used if we want to
predict the cluster labels for new points in future (e.g. using
approximate_predict), as the approximate_predict function is not aware
of this argument.
allow_single_branch (bool, optional (default=False)) – Analogous to allow_single_cluster. Note that depending on
label_sides_as_branches FLASC requires at least 3 branches to exist
in a cluster before they are incorporated in the final labelling.
Determines which graph is constructed to detect branches with. Valid
values are, ordered by increasing computation cost and decreasing
sensitivity to noise: - core: Contains the edges that connect each
point to all other
points within a mutual reachability distance lower than or equal to
the point’s core distance. This is the cluster’s subgraph of the k-NN
graph over the entire data set (with k = min_samples).
full: Contains all edges between points in each cluster with a
mutual reachability distance lower than or equal to the distance of
the most-distance point in each cluster. These graphs represent the
0-dimensional simplicial complex of each cluster at the first point in
the filtration where they contain all their points.
The method used to select branches from the cluster’s condensed tree.
The standard approach for FLASC is to use the eom approach. Options
are:
eom
leaf
branch_selection_epsilon (float, optional (default=0.0)) – A distance threshold. Branches below this value will be merged. See [3]_
for more information. Note that this should not be used if we want to
predict the cluster labels for new points in future (e.g. using
approximate_predict), as the approximate_predict function is not aware
of this argument.
branch_selection_persistence (float, optional (default=0.0)) – A persistence threshold. Branches with a persistence below this value
will be merged. Note that this should not be used if we want to predict
the cluster labels for new points in future (e.g. using
approximate_predict), as the approximate_predict function is not aware
of this argument.
label_sides_as_branches (bool, optional (default=False),) – When this flag is False, branches are only labelled for clusters with at
least three branches (i.e., at least y-shapes). Clusters with only two
branches represent l-shapes. The two branches describe the cluster’s
outsides growing towards each other. Enabling this flag separates these
branches from each other in the produced labelling.
memory (instance of joblib.Memory or str, optional) – Used to cache the output of the computation of the tree. By default, no
caching is done. If a string is given, it is the path to the caching
directory.
num_jobs (int, optional (default=None)) – Number of parallel jobs to run in core distance computations and branch
detection step. For num_jobs below -1, (n_cpus + 1 + num_jobs) are
used. By default, the algorithm tries to estimate whether the given
input is large enough for multi-processing to have a benefit. If so, 4
processes are started, otherwise only the main process is used. When a
num_jobs value is given, that number of jobs is used regardless of
the input size.
**kwargs (optional) – Additional arguments passed to hdbscan() or the distance
metric.
Returns:
labels (np.ndarray, shape (n_samples, )) – Cluster+branch labels for each point. Noisy samples are given the label
-1.
probabilities (np.ndarray, shape (n_samples, )) – Cluster+branch membership strengths for each point. Noisy samples are
assigned 0.
cluster_labels (np.ndarray, shape (n_samples, )) – Cluster labels for each point. Noisy samples are given the label -1.
cluster_probabilities (np.ndarray, shape (n_samples, )) – Cluster membership strengths for each point. Noisy samples are assigned
0.
cluster_persistence (array, shape (n_clusters, )) – A score of how persistent each cluster is. A score of 1.0 represents a
perfectly stable cluster that persists over all distance scales, while a
score of 0.0 represents a perfectly ephemeral cluster. These scores
gauge the relative coherence of the clusters output by the algorithm.
Not available when override_cluster_labels is used.
branch_labels (np.ndarray, shape (n_samples, )) – Branch labels for each point. Noisy samples are given the label -1.
branch_probabilities (np.ndarray, shape (n_samples, )) – Branch membership strengths for each point. Noisy samples are assigned
0.
branch_persistences (tuple (n_clusters)) – A branch persistence (eccentricity range) for each detected branch.
condensed_tree (record array) – The condensed cluster hierarchy used to generate clusters. Not available
when override_cluster_labels is used.
single_linkage_tree (np.ndarray, shape (n_samples - 1, 4)) – The single linkage tree produced during clustering in scipy hierarchical
clustering format (see
http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Not
available when override_cluster_labels is used.
min_spanning_tree (np.ndarray, shape (n_samples - 1, 3)) – The minimum spanning as an edgelist. Not available when
override_cluster_labels is used.
cluster_approximation_graphs (tuple (n_clusters)) – The graphs used to detect branches in each cluster stored as a numpy
array with four columns: source, target, centrality, mutual reachability
distance. Points are labelled by their row-index into the input data.
The edges contained in the graphs depend on the
branch_detection_method: - core: Contains the edges that connect
each point to all other
points in a cluster within a mutual reachability distance lower than
or equal to the point’s core distance. This is an extension of the
minimum spanning tree introducing only edges with equal distances. The
reachability distance introduces num_points * min_samples of
such edges.
full: Contains all edges between points in each cluster with a
mutual reachability distance lower than or equal to the distance of
the most-distance point in each cluster. These graphs represent the
0-dimensional simplicial complex of each cluster at the first point in
the filtration where they contain all their points.
branch_condensed_trees (tuple (n_clusters)) – A condensed branch hierarchy for each cluster produced during the branch
detection step. Data points are numbered with in-cluster ids.
branch_linkage_trees (tuple (n_clusters)) – A single linkage tree for each cluster produced during the branch
detection step, in the scipy hierarchical clustering format. (see
http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html). Data
points are numbered with in-cluster ids.
cluster_centralities (np.ndarray, shape (n_samples, )) – Centrality values for each point in a cluster. Overemphasizes points’
eccentricity within the cluster as the values are based on minimum
spanning trees that do not contain the equally distanced edges resulting
from the mutual reachability distance.
cluster_points (list (n_clusters)) – The data point row indices for each cluster.
Predict the cluster label of new points. The returned labels
will be those of the original clustering found by clusterer,
and therefore are not (necessarily) the cluster labels that would
be found by clustering the original data combined with
points_to_predict, hence the ‘approximate’ label.
If you simply wish to assign new points to an existing clustering
in the ‘best’ way possible, this is the function to use. If you
want to predict how points_to_predict would cluster with
the original data under FLASC the most efficient existing approach
is to simply re-cluster with the new point(s) added to the original dataset.
Parameters:
clusterer (FLASC) – A clustering object that has been fit with data.
points_to_predict (array, or array-like (n_samples, n_features)) – The new data points to predict cluster labels for. They should
have the same dimensionality as the original dataset over which
clusterer was fit.
Returns:
labels (array (n_samples,)) – The predicted labels of the points_to_predict
probabilities (array (n_samples,)) – The soft cluster scores for each of the points_to_predict
cluster_labels (array (n_samples,)) – The predicted cluster labels of the points_to_predict
cluster_probabilities (array (n_samples,)) – The soft cluster scores for each of the points_to_predict
branch_labels (array (n_samples,)) – The predicted cluster labels of the points_to_predict
branch_probabilities (array (n_samples,)) – The soft cluster scores for each of the points_to_predict
Predict soft branch-membership vectors for all points in the clusters
with more than two detected branches. Computes geodesic traversal depth
from branch-roots to all points in the cluster.
Parameters:
clusterer (FLASC) – A clusterer object that has been fit to vector input data.
Returns:
centrality_vectors – The centrality value of point i in cluster c of the original
dataset from the root of branch j is in
membership_vectors[c][i,j].
where \(\mathbf{m}\) is the scaled membership vector and
\(\mathbf{c}\) is the branch centrality vector.
Parameters:
branch_centrality_vectors (list[array (n_samples, n_branches)]) – The centrality values from the centroids of a cluster’s branches.
None if the cluster has two or fewer branches.
temperaturefloat, default=1.0
A scaling factor for the softmax function. A higher temperature
makes the distribution more uniform, a lower temperature makes
the distribution more peaky.
Returns:
scaled_branch_memberships – The probabilities of a point belonging to the cluster’s branches.
None if the cluster has two or fewer branches.
Updates the clusterer’s labels and branch labels by assigning
central points with a noise branch label to geodesically closest branch
root. Only changing points with a noise branch label makes sure than points
cannot move to a sibling branch in the branch hierarchy.
Parameters:
clusterer (FLASC) – A clustering object that has been fit to data.
branch_centrality_vectors (list[array (n_samples, n_branches)]) – The centrality values from the centroids of a cluster’s branches.
None if the cluster has two or fewer branches.
Returns:
labels – Updated cluster+branch labels for each point. Noisy samples are
given the label -1.
Return type:
np.ndarray, shape (n_samples, )
cluster_labelsnp.ndarray, shape (n_samples, )
Updated luster labels for each point. Noisy samples are given
the label -1.