dynsight.analysis.info_gain¶
- dynsight.analysis.info_gain(data, labels, method, base=2.0, n_neigh=4)[source]¶
Compute the information gained by the clustering.
- Parameters:
data (ndarray[Any, dtype[float64]]) – The dataset over which the clustering is performed. Has shape (n_samples, n_features).
labels (ndarray[Any, dtype[int64]]) – The clustering labels. Has shape (n_samples,).
method (Literal['histo', 'kl']) – How the Shannon entropy is computed. You should use “histo” for discrete variables, and “kl” for continuous variables. If “histo” is chosen, the “n_neigh” arg is irrelevant. See the documentation of the infomeasure package for more details (link in the notes below).
base (float) – The units of measure of the returned value. Use “2” for bits, “np.e” for nats.
n_neigh (int) – The number of neighbors considered in the KL estimator. The default value n_neigh = 4 is recommended in the literature.
- Returns:
The absolute information gain \(H_0 - H_{clust}\)
The relative information gain \((H_0 - H_{clust}) / H_0\)
The Shannon entropy of the initial data \(H_0\)
The shannon entropy of the clustered data \(H_{clust}\)
- Return type:
Notes
This function uses the
infomeasure.entropy()function, see https://infomeasure.readthedocs.io/en/latest/guide/entropy/.Example
import numpy as np from dynsight.analysis import info_gain rng = np.random.default_rng(seed=42) ### Descrete case ### int_data = rng.integers(low=0, high=4, size=100000) labels = (int_data < 2).astype(int) delta_i, *_ = info_gain( data=int_data, labels=labels, method="histo", )
### Continuous case ### float_data = rng.random(200000) labels = (float_data < 0.5).astype(int) delta_i, *_ = info_gain( data=float_data, labels=labels, method="kl", )