Tri-clustering¶
Introduction¶
The triclustering
module provides a generalization of the co-clustering algorithm to three-dimensional arrays (see
Ref. 1). For geospatial data, tri-clustering analyses allow extending the search for similarity patterns in
data cubes, thus accounting for an extra dimension (the ‘band’ dimension) in addition to space and time.
Setup the Analysis¶
The tri-clustering analysis of a three-dimensional array Z
:
import numpy as np
Z = np.array([[[1., 1., 2., 4.],
[1., 1., 2., 4.]],
[[5., 5., 8., 8.],
[5., 5., 8., 8.]],
[[6., 7., 8., 9.],
[6., 7., 9., 8.]]])
is setup by creating an instance of Triclustering
:
from cgc.triclustering import Triclustering
tc = Triclustering(
Z, # data array (3D)
nclusters_row=4, # number of row clusters
nclusters_col=3, # number of column clusters
nclusters_bnd=2, # number of band clusters
max_iterations=100, # maximum number of iterations
conv_threshold=1.e-5, # error convergence threshold
nruns=10, # number of differently-initialized runs
output_filename='results.json' # JSON file where to write output
)
The input arguments of Triclustering
are identical to the Coclustering
ones (see Co-clustering) -
nclusters_bnd
is the only additional argument, which sets the maximum number of clusters along the ‘band’ dimension.
Note that a lower number of clusters can be identified by the algorithm (some of the clusters may remain empty).
Note
The first axis of Z
is assumed to represent the ‘band’ dimension.
Tri-clustering Implementations¶
Local (Numpy-based)¶
As for the co-clustering algorithm (see Co-clustering), multiple runs of the tri-clustering algorithm can be efficiently computed in parallel using threads. In order to run the tri-clustering analysis using 4 threads:
results = tc.run_with_threads(nthreads=4)
As for co-clustering, only one thread is spawned if the nthreads
argument is not provided.
Distributed (Dask-based)¶
Also for the tri-clustering, analysis on distributed systems can be carried out using Dask (see also Co-clustering). Once the connection to a Dask cluster is setup:
from dask.distributed import Client
client = Client('tcp://daskscheduler:8786') # connect to the Dask scheduler
the tri-clustering analysis is carried out as:
results = tc.run_with_dask(client)
If no client is provided as argument, a default LocalCluster
is instantiated and made use of (see Dask docs).
Performance Comparison¶
This notebook presents a performance comparison of the two tri-clustering implementations for varying input data size and number of clusters. To test the Dask implementation, we have used a local thread-based cluster with four workers. As for co-clustering, we find the Numpy implementation to be much faster (~2 orders of magnitude) than the Dask implementation for small datasets, where the Dask overhead dominates. However, when the system size becomes sufficiently large and/or the number of clusters is increased, the Dask implementation leads to shorter timings. It is important to stress here as well how the Dask implementation was not designed for improved performances, but to handle large datasets that could not be otherwise tackled due to memory limitations.
Results¶
The TriclusteringResults
object returned by Triclustering.run_with_threads
and Triclustering.run_with_dask
contains the final row, column, and band cluster assignments (results.row_clusters
, results.col_clusters
, and
results.bnd_clusters
, respectively) as well as the approximation error of the tri-clustering (results.error
).
Few other metadata are also present, including the input parameters employed to setup the analysis
(results.input_parameters
).
API¶
- class cgc.triclustering.Triclustering(Z, nclusters_row, nclusters_col, nclusters_bnd, conv_threshold=1e-05, max_iterations=1, nruns=1, output_filename='', row_clusters_init=None, col_clusters_init=None, bnd_clusters_init=None)¶
Perform a tri-clustering analysis for a three-dimensional array.
- Parameters
Z (numpy.ndarray or dask.array.Array) – Data array for which to run the tri-clustering analysis, with shape (band, row, column).
nclusters_row (int) – Number of row clusters.
nclusters_col (int) – Number of column clusters.
nclusters_bnd (int) – Number of band clusters.
conv_threshold (float, optional) – Convergence threshold for the objective function.
max_iterations (int, optional) – Maximum number of iterations.
nruns (int, optional) – Number of differently-initialized runs.
output_filename (string, optional) – Name of the JSON file where to write the results.
row_clusters_init (numpy.ndarray or array_like, optional) – Initial row cluster assignment.
col_clusters_init (numpy.ndarray or array_like, optional) – Initial column cluster assignment.
bnd_clusters_init (numpy.ndarray or array_like, optional) – Initial band cluster assignment.
- Example
>>> import numpy as np >>> Z = np.random.randint(1, 100, size=(6, 10, 8)).astype('float64') >>> tc = Triclustering(Z, nclusters_row=5, nclusters_col=4, max_iterations=50, nruns=10)
- run_serial()¶
- run_with_dask(client=None)¶
Run the tri-clustering analysis using Dask.
- Parameters
client (dask.distributed.Client, optional) – Dask client. If not specified, the default LocalCluster is employed.
- Returns
Tri-clustering results.
- Type
- run_with_threads(nthreads=1)¶
Run the tri-clustering using an algorithm based on Numpy plus threading (only suitable for local runs).
- Parameters
nthreads (int, optional) – Number of threads employed to simultaneously run differently-initialized tri-clustering analysis.
- Returns
tri-clustering results.
- Type
- class cgc.triclustering.TriclusteringResults(**input_parameters)¶
Contains results and metadata of a tri-clustering calculation.
- Variables
row_clusters – Final row cluster assignment.
col_clusters – Final column cluster assignment.
bnd_clusters – Final band cluster assignment.
error – Approximation error of the tri-clustering.
nruns_completed – Number of successfully completed runs.
nruns_converged – Number of converged runs.