# K-means refinement¶

## Introduction¶

The kmeans module is an implementation of the k-means clustering to refine the results of a co-clustering or tri-clustering calculation. This k-means refinement allows identifying similarity patterns between co- or tri-clusters. The following features, computed over all elements belonging to the same co- or tri-cluster, are employed by default for the k-means clustering:

1. Mean value;

2. Standard deviation;

3. Minimum value;

4. Maximum value;

5. 5th percentile;

6. 95th percentile;

However, the user can customize the set of statistics computed over the clusters. The implementation, which is based on the scikit-learn package, tests a range of k values and select the optimal one based on the silhouette coefficient.

## Running the refinement¶

The k-means refinement should be based on existing co- or tri-clustering results:

```import numpy as np

Z = np.array([[1., 1., 2., 4.],
[1., 1., 2., 4.],
[3., 3., 3., 5.]])
row_clusters = np.array([0, 0, 1, 2])  # 3 clusters
col_cluster = np.array([0, 0, 1])  # 2 clusters
```

One can then setup `KMeans` in the following way:

```from cgc.kmeans import KMeans

km = KMeans(
Z,
clusters=(row_clusters, col_cluster),
nclusters=(3, 2)
k_range=range(2, 5),
kmeans_kwargs={'init': 'random', 'n_init': 100},
output_filename='results.json' # JSON file where to write output
)
```

Here `k_range` is the range of `k` values to investigate. If not provided, a sensible range will be setup (from 2 to a fraction of the number of co- or tri-clusters - the optional `max_k_ratio` argument allows for additional control, see API). `kmeans_kwargs` contains input arguments passed on to the scikit-learn KMeans object upon initialization (here we define the initialization procedure). By using the optional argument `statistics`, the user can define a custom set of statistics employed for the k-means refinement (see the API).

The `compute` function is then called to run the k-means refinement:

```results = km.compute()
```

## Results¶

The optimal `k` value and the refined cluster averages computed over all elements assigned to the co- and tri-clusters are stored in the `KMeansResults` object:

```results.k_value
results.cluster_averages
```

## API¶

class cgc.kmeans.KMeans(Z, clusters, nclusters, k_range=None, max_k_ratio=0.8, kmeans_kwargs=None, statistics=None, output_filename='')

Perform a clustering refinement using k-means.

A set of statistics is computed for all co- or tri-clusters, then these clusters are in turned grouped using k-means. K-means clustering is performed for multiple k values, then the optimal value is selected on the basis of the silhouette coefficient.

Parameters
• Z (numpy.ndarray or dask.array.Array) – Data array (N dimensions).

• clusters (tuple, list, or numpy.ndarray) – Iterable with length N. It should contain the cluster labels for each dimension, following the same ordering as for Z.

• nclusters (tuple, list, or numpy.ndarray) – Iterable with length N. It should contains the number of clusters in each dimension, following the same ordering as for Z.

• k_range (tuple, list, or numpy.ndarray, optional) – Range of k values to test. Default from 2 to a fraction of the number of non-empty clusters (see max_k_ratio).

• max_k_ratio (float, optional) – If k_range is not provided, test all k values from 2 to max_k_ratio*max_k, where max_k is the number of non-empty co- or tri-clusters. It will be ignored if k_range is given. Default to 0.8.

• kmeans_kwargs (dict, optional) – Arguments passed on when initializing the scikit-learn’s KMeans object.

• statistics (tuple or list, optional) – Statistics to be computed over the clusters, which are then used to refine these. These are provided as an iterable of callable functions, with optional keyword arguments. For example: [(func1, {‘kwarg1’: val1, …}), (func2, {‘kwarg2’: val2, …}, …] . See cgc.kmeans.DEFAULT_STATISTICS for the default statistics, and cgc.utils.calculate_cluster_feature for input function requirements.

• output_filename (str, optional) – Name of the file where to write the results.

Example

```>>> import numpy as np
>>> Z = np.array([[4, 4, 1, 1], [4, 4, 1, 1], [2, 2, 3, 3], [2, 2, 3, 3],
[2, 2, 3, 3]])
>>> clusters = [np.array([0, 0, 1, 1, 1]), np.array([0, 0, 1, 1])]
>>> km = KMeans(Z=Z,
clusters=clusters,
nclusters=[2, 2],
k_range= range(2, 4),
kmeans_kwargs={"max_iter": 100})
```
compute(recalc_statistics=False)

Compute statistics for each clustering group. Then loop through the range of k values, and compute the averaged silhouette measure of each k value. Finally select the k with the maximum silhouette measure.

Parameters

recalc_statistics (bool, optional) – If True, always recompute statistics.

Returns

K-means results.

Type

cgc.kmeans.KMeansResults

class cgc.kmeans.KMeansResults(**input_parameters)

Contains results and metadata of a k-means refinement calculation.

Variables
• k_value – Optimal K value (value with maximum silhouette score).

• labels – Refined clusters labels. It is a 2D- (for coclustering) or 3D- (for triclustering) array, with the shape of nclusters. The value at location (band, row, column) represents the refined cluster label of the corresponding band/row/column cluster combination.

• inertia – List of inertia values for all tested k values.

• measure_list – List of silhouette coefficients for all tested k values.

• cluster_averages – Refined cluster averages. They are computed as means over all elements of the co-/tri-clusters assigned to the refined clusters. Initially empty clusters are assigned NaN values.