dynsight.data_processing.cleaning_cluster_population

dynsight.data_processing.cleaning_cluster_population(labels, threshold, assigned_env, excluded_env=None)[source]

Replace labels of low-population clusters with a reference label.

This function identifies clusters whose relative population is below a given threshold and reassigns their labels to a specified environment. The population of each cluster is computed as the fraction of elements belonging to that label, either for 2D inputs ((n_atoms, n_frames)) or for 3D inputs ((n_atoms, n_frames, n_dims), where n_dims can correspond to the different ∆t from Onion clustering). Clusters with a population smaller than or equal to the threshold are considered negligible and are replaced by the assigned_env label, while all other labels are preserved. excluded_env give the possibility to exclude some clusters from the re-labeling.

Parameters:
  • labels (NDArray[np.int64]) – NumPy array containing the label values. The array should have dimensions corresponding to either (n_atoms, n_frames) for 2D inputs, or (n_atoms, n_frames, n_dims) for 3D inputs.

  • threshold (float) – A float value from 0 to 1 that defines the threshold at which small clusters are neglected.

  • assigned_env (int) – The label at which smaller clusters are assigned to, if the label already exists the population extracted will be merged to the existing one.

  • excluded_env (int | list[int] | None) – Clusters that need to be preserved even if their population is under the threshold.

Returns:

A NumPy array of the same shape as the input descriptor array, containing the updated labels. If the input array is 2D (n_atoms, n_frames), the output will be a 2D array of the same shape. Otherwise, if the input is 3D (n_atoms, n_frames, n_dims), the output will also be a 3D array of the same shape. The labels of bigger clusters are uneffected by the re-labeling.

Raises:

ValueError – If the input descriptor array does not have 2 or 3 dimensions, an error is raised.

Return type:

NDArray[np.int64]

Example

from dynsight.data_processing import cleaning_cluster_population
import numpy as np

original_labels = np.load('labels_array.npy')

cleaned_labels = cleaning_cluster_population(
    labels=original_labels,
    threshold=0.1,
    assigned_env=99,
)

In this example, the labels of the smaller clusters (lower than 10%) from original_labels are replaced with label 99. The result is stored in cleaned_labels, a NumPy array.