dynsight.analysis.info_gain

dynsight.analysis.info_gain(data, labels, method, base=2.0, n_neigh=4)[source]

Compute the information gained by the clustering.

Parameters:
  • data (ndarray[Any, dtype[float64]]) – The dataset over which the clustering is performed. Has shape (n_samples, n_features).

  • labels (ndarray[Any, dtype[int64]]) – The clustering labels. Has shape (n_samples,).

  • method (Literal['histo', 'kl']) – How the Shannon entropy is computed. You should use “histo” for discrete variables, and “kl” for continuous variables. If “histo” is chosen, the “n_neigh” arg is irrelevant. See the documentation of the infomeasure package for more details (link in the notes below).

  • base (float) – The units of measure of the returned value. Use “2” for bits, “np.e” for nats.

  • n_neigh (int) – The number of neighbors considered in the KL estimator. The default value n_neigh = 4 is recommended in the literature.

Returns:

  • The absolute information gain \(H_0 - H_{clust}\)

  • The relative information gain \((H_0 - H_{clust}) / H_0\)

  • The Shannon entropy of the initial data \(H_0\)

  • The shannon entropy of the clustered data \(H_{clust}\)

Return type:

tuple[float, float, float, float]

Notes

This function uses the infomeasure.entropy() function, see https://infomeasure.readthedocs.io/en/latest/guide/entropy/.

Example

import numpy as np
from dynsight.analysis import info_gain
rng = np.random.default_rng(seed=42)

### Descrete case ###
int_data = rng.integers(low=0, high=4, size=100000)
labels = (int_data < 2).astype(int)
delta_i, *_ = info_gain(
    data=int_data,
    labels=labels,
    method="histo",
)
### Continuous case ###
float_data = rng.random(200000)
labels = (float_data < 0.5).astype(int)
delta_i, *_ = info_gain(
    data=float_data,
    labels=labels,
    method="kl",
)