Kmeans refinement¶
Introduction¶
The Kmeans module is an implementation of the kmeans clustering to refine the results of a coclustering or triclustering calculation. This kmean refinement allows identifying similarity patterns between co or triclusters. The following predefined features, computed over all elements belonging to the same co or tricluster, are employed for the kmeans clustering:
 Mean value;
 Standard deviation;
 Minimum value;
 Maximum value;
 5th percentile;
 95th percentile;
The implementation, which is based on the scikitlearn package, tests a range of k values and select the optimal one based on the Silhouette coefficient.
Running the refinement¶
The kmeans refinement should be based on existing co or triclustering results:
import numpy as np
Z = np.array([[1., 1., 2., 4.],
[1., 1., 2., 4.],
[3., 3., 3., 5.]])
row_clusters = np.array([0, 0, 1, 2]) # 3 clusters
col_cluster = np.array([0, 0, 1]) # 2 clusters
One can then setup Kmeans
in the following way:
from cgc.kmeans import Kmeans
km = Kmeans(
Z,
clusters=(row_clusters, col_cluster),
nclusters=(3, 2)
k_range=range(2, 5),
kmean_max_iter=100,
output_filename='results.json' # JSON file where to write output
)
Here k_range
is the range of k
values to investigate. If not provided, a sensible range will be setup (from 2 to
a fraction of the number of co or triclusters  the optional max_k_ratio argument allows for additional control, see
API). kmean_max_iter
is the maximum number of iterations employed for the kmeans clustering.
The compute
function is then called to run the kmeans refinement:
results = km.compute()
Results¶
The optimal k
value and the refined cluster averages computed over all elements assigned to the co and triclusters
are stored in the KmeansResults
object:
results.k_value
results.cluster_averages
API¶

class
cgc.kmeans.
Kmeans
(Z, clusters, nclusters, k_range=None, max_k_ratio=0.8, kmean_max_iter=100, statistics=None, output_filename='')¶ Perform a clustering refinement using kmeans.
A set of statistics is computed for all co or triclusters, then these clusters are in turned grouped using kmeans. Kmeans clustering is performed for multiple k values, then the optimal value is selected on the basis of the Silhouette coefficient.
Parameters:  Z (numpy.ndarray or dask.array.Array) – Data array (N dimensions).
 clusters (tuple, list, or numpy.ndarray) – Iterable with length N. It should contain the cluster labels for each dimension, following the same ordering as for Z.
 nclusters (tuple, list, or numpy.ndarray) – Iterable with length N. It should contains the number of clusters in each dimension, following the same ordering as for Z.
 k_range (tuple, list, or numpy.ndarray, optional) – Range of k values to test. Default from 2 to a fraction of the number of nonempty clusters (see max_k_ratio).
 max_k_ratio (float, optional) – If k_range is not provided, test all k values from 2 to max_k_ratio*max_k, where max_k is the number of nonempty co or triclusters. It will be ignored if k_range is given. Default to 0.8.
 kmean_max_iter (int, optional) – Maximum number of iterations of kmeans.
 statistics (tuple or list, optional) – Statistics to be computed over the clusters, which are then used to refine these. These are provided as an iterable of callable functions, with optional keyword arguments. For example: [(func1, {‘kwarg1’: val1, …}), (func2, {‘kwarg2’: val2, …}, …] . See cgc.kmeans.DEFAULT_STATISTICS for the default statistics, and cgc.utils.calculate_cluster_feature for input function requirements.
 output_filename (str, optional) – Name of the file where to write the results.
Example: >>> import numpy as np >>> Z = np.array([[4, 4, 1, 1], [4, 4, 1, 1], [2, 2, 3, 3], [2, 2, 3, 3], [2, 2, 3, 3]]) >>> clusters = [np.array([0, 0, 1, 1, 1]), np.array([0, 0, 1, 1])] >>> km = Kmeans(Z=Z, clusters=clusters, nclusters=[2, 2], k_range= range(2, 4), kmean_max_iter=2)

compute
(recalc_statistics=False)¶ Compute statistics for each clustering group. Then loop through the range of k values, and compute the averaged Silhouette measure of each k value. Finally select the k with the maximum Silhouette measure.
Parameters: recalc_statistics (bool, optional) – If True, always recompute statistics. Returns: Kmeans results. Type: cgc.kmeans.KmeansResults

class
cgc.kmeans.
KmeansResults
(**input_parameters)¶ Contains results and metadata of a kmeans refinement calculation.
Variables:  k_value – Optimal K value (value with maximum Silhouette score).
 labels – Refined clusters labels. It is a 2D (for coclustering) or 3D (for triclustering) array, with the shape of nclusters. The value at location (band, row, column) represents the refined cluster label of the corresponding band/row/column cluster combination.
 measure_list – List of Silhouette coefficients for all tested k values.
 cluster_averages – Refined cluster averages. They are computed as means over all elements of the co/triclusters assigned to the refined clusters. Initially empty clusters are assigned NaN values.