API Reference

class fast_hbcc.HBCC(num_hops: int = 2, min_samples: int = 5, min_cluster_size: int = 25, hop_type: Literal['manifold', 'metric'] = 'manifold', boundary_connectivity: Literal['knn', 'core'] = 'knn', boundary_use_reachability: bool = True, cluster_selection_method: Literal['eom', 'leaf'] = 'eom', allow_single_cluster: bool = False, max_cluster_size: float = inf, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0, ss_algorithm: Literal['bc', 'bc_simple'] = 'bc', num_jobs: int = 0)

An SKLEARN-style estimator for computing Hierarchical Boundary Coefficient Clustering (HBCC). This algorithm is inspired by the work of Peng et al. [4], uses algorithms from Vandaele et al. [2], Campello et al. [1], McInnes et al. [3] and relies on code from fast_hdbscan.

The algorithm contains the following steps:

Compute k nearest neighbors and minimum spanning tree.
Compute the boundary coefficient for each point.
Compute minimum spanning tree from boundary coefficient weighted knn–mst graph union.

Compute HDBSCAN cluster hierarchy and selection.

Parameters:

num_hops: int, default=2: The number of hops used to expand the boundary coefficient graph.
min_samples: int, default=5: Core distance is computed as the min_samples-nearest neighbor distance.
min_cluster_size: int, default=25: The minimum number of samples in a cluster.
hop_type: ‘manifold’ or ‘metric’, default=’manifold’: The type of hop expansion to use. Manifold adds edge distances on traversal, metric computes distance between visited points.
boundary_connectivity: ‘knn’ or ‘core’, default=’knn’: Which graph to compute the boundary coefficient on. ‘knn’ uses the k-nearest neighbors graph, ‘core’ uses the knn–mst union graph.
boundary_use_reachability: boolean, default=False: Whether to use mutual reachability or raw distances for the boundary coefficient computation.
cluster_selection_method: ‘eom’ or ‘leaf’, default=’eom’: HDBSCAN cluster selection strategy.
allow_single_cluster: bool, default=False: HDBSCAN cluster selection parameter controlling whether to allow single cluster during selection.
max_cluster_size: int, default=np.inf: HDBSCAN cluster selection parameter limiting the maximum cluster size.
cluster_selection_epsilon: float, default=0.0: HDBSCAN cluster selection parameter controlling minimum cluster death distance.
cluster_selection_persistence: float, default=0.0: HDBSCAN cluster selection parameter controlling minimum cluster persistence.
ss_algorithm: ‘bc’ or ‘bc_simple’, default=’bc’: HDBSCAN clustering selection parameter controlling the semi-supervised strategy.
num_jobsint, optional (default=0): The number of threads to use for the computation. Zero means using all threads. Negative values indicate all but that number of threads.

Attributes:

labels_numpy.ndarray, shape (n_samples,): The computed cluster labels.
probabilities_: numpy.ndarray, shape (n_samples,): The computed cluster probabilities.
boundary_coefficient_numpy.ndarray, shape (n_samples,): The computed boundary coefficient for each point.
condensed_tree_hdbscan.plots.CondensedTree: The condensed cluster hierarchy used to generate clusters.
single_linkage_treehdbscan.plots.SingleLinkageTree: The single linkage tree produced during clustering in scipy hierarchical clustering format (see http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html).
min_spanning_tree_hdbscan.plots.MinimumSpanningTree: The minimum spanning as an edgelist. If gen_min_span_tree was False this will be None.

References

[1]

Campello, R. J., Moulavi, D., & Sander, J. (2013, April). Density-based clustering based on hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 160-172). Springer Berlin Heidelberg.

[2]

Vandaele, R., Saeys, Y., & De Bie, T. (2019). The Boundary Coefficient : a Vertex Measure for Visualizing and Finding Structure in Weighted Graphs. 15th International Workshop on Mining and Learning with Graphs (MLG).

[3]

McInnes, L., & Healy, J. (2017). Accelerated Hierarchical Density Based Clustering. 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 2017-Novem, 33–42. https://doi.org/10.1109/ICDMW.2017.12.

[4]

Peng, D., Gui, Z., Wang, D., Ma, Y., Huang, Z., Zhou, Y., & Wu, H. (2022). Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity. Nature Communications, 13(1), 1–14. https://doi.org/10.1038/s41467-022-33136-9.

property approximation_graph_: See ApproximationGraph for documentation.

fit(X, y=None, sample_weight=None, **fit_params)

Computes the Hierarchical Boundary Coefficient Clustering (HBCC).

Parameters:

X: float[:, ::1]: The data to cluster.
y: int[::1], optional: Datapoint labels for semi-supervised clustering.
**fit_params: dict: Ignored.

Returns:

self: HBCC: The fitted estimator.

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → HBCC

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit.

Returns:

selfobject: The updated object.

class fast_hbcc.BoundaryClusterDetector(num_hops: int = 2, hop_type: Literal['manifold', 'metric'] = 'manifold', boundary_connectivity: Literal['knn', 'core'] = 'knn', boundary_use_reachability: bool = True, min_cluster_size: int | None = None, max_cluster_size: int | None = None, allow_single_cluster: bool | None = None, cluster_selection_method: Literal['eom', 'leaf'] | None = None, cluster_selection_epsilon: float = 0.0, cluster_selection_persistence: float = 0.0)

Performs a post-processing step to detect boundary clusters within HDBSCAN clusters. The process follows [1] but uses the boundary coefficient as distance, rather than centrality.

References

[1]

Bot, D. M., Peeters, J., Liesenborgs J., & Aerts, J. (2023, November). FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN* for Detecting Branches in Clusters. arXiv:2311.15887.