dynsight.data_processing.cleaning_cluster_population¶
- dynsight.data_processing.cleaning_cluster_population(labels, threshold, assigned_env, excluded_env=None)[source]¶
Replace labels of low-population clusters with a reference label.
This function identifies clusters whose relative population is below a given threshold and reassigns their labels to a specified environment. The population of each cluster is computed as the fraction of elements belonging to that label, either for 2D inputs ((n_atoms, n_frames)) or for 3D inputs ((n_atoms, n_frames, n_dims), where n_dims can correspond to the different ∆t from Onion clustering). Clusters with a population smaller than or equal to the threshold are considered negligible and are replaced by the assigned_env label, while all other labels are preserved. excluded_env give the possibility to exclude some clusters from the re-labeling.
- Parameters:
labels (NDArray[np.int64]) – NumPy array containing the label values. The array should have dimensions corresponding to either (n_atoms, n_frames) for 2D inputs, or (n_atoms, n_frames, n_dims) for 3D inputs.
threshold (float) – A float value from 0 to 1 that defines the threshold at which small clusters are neglected.
assigned_env (int) – The label at which smaller clusters are assigned to, if the label already exists the population extracted will be merged to the existing one.
excluded_env (int | list[int] | None) – Clusters that need to be preserved even if their population is under the threshold.
- Returns:
A NumPy array of the same shape as the input descriptor array, containing the updated labels. If the input array is 2D (n_atoms, n_frames), the output will be a 2D array of the same shape. Otherwise, if the input is 3D (n_atoms, n_frames, n_dims), the output will also be a 3D array of the same shape. The labels of bigger clusters are uneffected by the re-labeling.
- Raises:
ValueError – If the input descriptor array does not have 2 or 3 dimensions, an error is raised.
- Return type:
NDArray[np.int64]
Example
from dynsight.data_processing import cleaning_cluster_population import numpy as np original_labels = np.load('labels_array.npy') cleaned_labels = cleaning_cluster_population( labels=original_labels, threshold=0.1, assigned_env=99, )
In this example, the labels of the smaller clusters (lower than 10%) from original_labels are replaced with label 99. The result is stored in cleaned_labels, a NumPy array.