API Reference
- class fast_hbcc.HBCC(num_hops: int = 2, min_samples: int = 5, min_cluster_size: int = 25, hop_type: Literal['manifold', 'metric'] = 'manifold', boundary_connectivity: Literal['knn', 'core'] = 'knn', boundary_use_reachability: bool = True, cluster_selection_method: Literal['eom', 'leaf'] = 'eom', allow_single_cluster: bool = False, max_cluster_size: float = inf, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0, ss_algorithm: Literal['bc', 'bc_simple'] = 'bc', num_jobs: int = 0)
An SKLEARN-style estimator for computing Hierarchical Boundary Coefficient Clustering (HBCC). This algorithm is inspired by the work of Peng et al. [4], uses algorithms from Vandaele et al. [2], Campello et al. [1], McInnes et al. [3] and relies on code from fast_hdbscan.
The algorithm contains the following steps:
Compute k nearest neighbors and minimum spanning tree.
Compute the boundary coefficient for each point.
Compute minimum spanning tree from boundary coefficient weighted knn–mst graph union.
Compute HDBSCAN cluster hierarchy and selection.
- Parameters:
- num_hops: int, default=2
The number of hops used to expand the boundary coefficient graph.
- min_samples: int, default=5
Core distance is computed as the min_samples-nearest neighbor distance.
- min_cluster_size: int, default=25
The minimum number of samples in a cluster.
- hop_type: ‘manifold’ or ‘metric’, default=’manifold’
The type of hop expansion to use. Manifold adds edge distances on traversal, metric computes distance between visited points.
- boundary_connectivity: ‘knn’ or ‘core’, default=’knn’
Which graph to compute the boundary coefficient on. ‘knn’ uses the k-nearest neighbors graph, ‘core’ uses the knn–mst union graph.
- boundary_use_reachability: boolean, default=False
Whether to use mutual reachability or raw distances for the boundary coefficient computation.
- cluster_selection_method: ‘eom’ or ‘leaf’, default=’eom’
HDBSCAN cluster selection strategy.
- allow_single_cluster: bool, default=False
HDBSCAN cluster selection parameter controlling whether to allow single cluster during selection.
- max_cluster_size: int, default=np.inf
HDBSCAN cluster selection parameter limiting the maximum cluster size.
- cluster_selection_epsilon: float, default=0.0
HDBSCAN cluster selection parameter controlling minimum cluster death distance.
- cluster_selection_persistence: float, default=0.0
HDBSCAN cluster selection parameter controlling minimum cluster persistence.
- ss_algorithm: ‘bc’ or ‘bc_simple’, default=’bc’
HDBSCAN clustering selection parameter controlling the semi-supervised strategy.
- num_jobsint, optional (default=0)
The number of threads to use for the computation. Zero means using all threads. Negative values indicate all but that number of threads.
- Attributes:
- labels_numpy.ndarray, shape (n_samples,)
The computed cluster labels.
- probabilities_: numpy.ndarray, shape (n_samples,)
The computed cluster probabilities.
- boundary_coefficient_numpy.ndarray, shape (n_samples,)
The computed boundary coefficient for each point.
- condensed_tree_hdbscan.plots.CondensedTree
The condensed cluster hierarchy used to generate clusters.
- single_linkage_treehdbscan.plots.SingleLinkageTree
The single linkage tree produced during clustering in scipy hierarchical clustering format (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html).
- min_spanning_tree_hdbscan.plots.MinimumSpanningTree
The minimum spanning as an edgelist. If gen_min_span_tree was False this will be None.
References
[1]Campello, R. J., Moulavi, D., & Sander, J. (2013, April). Density-based clustering based on hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 160-172). Springer Berlin Heidelberg.
[2]Vandaele, R., Saeys, Y., & De Bie, T. (2019). The Boundary Coefficient : a Vertex Measure for Visualizing and Finding Structure in Weighted Graphs. 15th International Workshop on Mining and Learning with Graphs (MLG).
[3]McInnes, L., & Healy, J. (2017). Accelerated Hierarchical Density Based Clustering. 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 2017-Novem, 33–42. https://doi.org/10.1109/ICDMW.2017.12.
[4]Peng, D., Gui, Z., Wang, D., Ma, Y., Huang, Z., Zhou, Y., & Wu, H. (2022). Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity. Nature Communications, 13(1), 1–14. https://doi.org/10.1038/s41467-022-33136-9.
- property approximation_graph_
See
ApproximationGraph
for documentation.
- fit(X, y=None, sample_weight=None, **fit_params)
Computes the Hierarchical Boundary Coefficient Clustering (HBCC).
- Parameters:
- X: float[:, ::1]
The data to cluster.
- y: int[::1], optional
Datapoint labels for semi-supervised clustering.
- **fit_params: dict
Ignored.
- Returns:
- self: HBCC
The fitted estimator.
- set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') HBCC
Request metadata passed to the
fit
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed tofit
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it tofit
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
- sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
sample_weight
parameter infit
.
- Returns:
- selfobject
The updated object.
- class fast_hbcc.BoundaryClusterDetector(num_hops: int = 2, hop_type: Literal['manifold', 'metric'] = 'manifold', boundary_connectivity: Literal['knn', 'core'] = 'knn', boundary_use_reachability: bool = True, min_cluster_size: int | None = None, max_cluster_size: int | None = None, allow_single_cluster: bool | None = None, cluster_selection_method: Literal['eom', 'leaf'] | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0)
Performs a post-processing step to detect boundary clusters within HDBSCAN clusters. The process follows [1] but uses the boundary coefficient as distance, rather than centrality.
References
[1]Bot, D. M., Peeters, J., Liesenborgs J., & Aerts, J. (2023, November). FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN* for Detecting Branches in Clusters. arXiv:2311.15887.