triclustering module provides a generalization of the co-clustering algorithm to three-dimensional arrays (see
Ref. ). For geospatial data, tri-clustering analyses allow extending the search for similarity patterns in
data cubes, thus accounting for an extra dimension (the ‘band’ dimension) in addition to space and time.
Setup the Analysis¶
The tri-clustering analysis of a three-dimensional array
import numpy as np Z = np.array([[[1., 1., 2., 4.], [1., 1., 2., 4.]], [[5., 5., 8., 8.], [5., 5., 8., 8.]], [[6., 7., 8., 9.], [6., 7., 9., 8.]]])
is setup by creating an instance of
from cgc.triclustering import Triclustering tc = Triclustering( Z, # data array (3D) nclusters_row=4, # number of row clusters nclusters_col=3, # number of column clusters nclusters_bnd=2, # number of band clusters max_iterations=100, # maximum number of iterations conv_threshold=1.e-5, # error convergence threshold nruns=10, # number of differently-initialized runs output_filename='results.json' # JSON file where to write output )
The input arguments of
Triclustering are identical to the
Coclustering ones (see Co-clustering) -
nclusters_bnd is the only additional argument, which sets the maximum number of clusters along the ‘band’ dimension.
Note that a lower number of clusters can be identified by the algorithm (some of the clusters may remain empty).
The first axis of
Z is assumed to represent the ‘band’ dimension.
As for the co-clustering algorithm (see Co-clustering), multiple runs of the tri-clustering algorithm can be efficiently computed in parallel using threads. In order to run the tri-clustering analysis using 4 threads:
results = tc.run_with_threads(nthreads=4)
As for co-clustering, only one thread is spawned if the
nthreads argument is not provided.
Also for the tri-clustering, analysis on distributed systems can be carried out using Dask (see also Co-clustering). Once the connection to a Dask cluster is setup:
from dask.distributed import Client client = Client('tcp://daskscheduler:8786') # connect to the Dask scheduler
the tri-clustering analysis is carried out as:
results = tc.run_with_dask(client)
If no client is provided as argument, a default
LocalCluster is instantiated and made use of (see Dask docs).
This notebook presents a performance comparison of the two tri-clustering implementations for varying input data size and number of clusters. To test the Dask implementation, we have used a local thread-based cluster with four workers. As for co-clustering, we find the Numpy implementation to be much faster (~2 orders of magnitude) than the Dask implementation for small datasets, where the Dask overhead dominates. However, when the system size becomes sufficiently large and/or the number of clusters is increased, the Dask implementation leads to shorter timings. It is important to stress here as well how the Dask implementation was not designed for improved performances, but to handle large datasets that could not be otherwise tackled due to memory limitations.
TriclusteringResults object returned by
contains the final row, column, and band cluster assignments (
results.bnd_clusters, respectively) as well as the approximation error of the tri-clustering (
Few other metadata are also present, including the input parameters employed to setup the analysis
Triclustering(Z, nclusters_row, nclusters_col, nclusters_bnd, conv_threshold=1e-05, max_iterations=1, nruns=1, output_filename='', row_clusters_init=None, col_clusters_init=None, bnd_clusters_init=None)¶
Perform a tri-clustering analysis for a three-dimensional array.
- Z (numpy.ndarray or dask.array.Array) – Data array for which to run the tri-clustering analysis, with shape (band, row, column).
- nclusters_row (int) – Number of row clusters.
- nclusters_col (int) – Number of column clusters.
- nclusters_bnd (int) – Number of band clusters.
- conv_threshold (float, optional) – Convergence threshold for the objective function.
- max_iterations (int, optional) – Maximum number of iterations.
- nruns (int, optional) – Number of differently-initialized runs.
- output_filename (string, optional) – Name of the JSON file where to write the results.
- row_clusters_init (numpy.ndarray or array_like, optional) – Initial row cluster assignment.
- col_clusters_init (numpy.ndarray or array_like, optional) – Initial column cluster assignment.
- bnd_clusters_init (numpy.ndarray or array_like, optional) – Initial band cluster assignment.
>>> import numpy as np >>> Z = np.random.randint(1, 100, size=(6, 10, 8)).astype('float64') >>> tc = Triclustering(Z, nclusters_row=5, nclusters_col=4, max_iterations=50, nruns=10)
Run the tri-clustering analysis using Dask.
Parameters: client (dask.distributed.Client, optional) – Dask client. If not specified, the default LocalCluster is employed. Returns: Tri-clustering results. Type: cgc.triclustering.TriclusteringResults
Run the tri-clustering using an algorithm based on Numpy plus threading (only suitable for local runs).
Parameters: nthreads (int, optional) – Number of threads employed to simultaneously run differently-initialized tri-clustering analysis. Returns: tri-clustering results. Type: cgc.triclustering.TriclusteringResults
Contains results and metadata of a tri-clustering calculation.
- row_clusters – Final row cluster assignment.
- col_clusters – Final column cluster assignment.
- bnd_clusters – Final band cluster assignment.
- error – Approximation error of the tri-clustering.
- nruns_completed – Number of successfully completed runs.
- nruns_converged – Number of converged runs.