Triclustering¶
Introduction¶
The triclustering
module provides a generalization of the coclustering algorithm to threedimensional arrays (see
Ref. [1]). For geospatial data, triclustering analyses allow extending the search for similarity patterns in
data cubes, thus accounting for an extra dimension (the ‘band’ dimension) in addition to space and time.
Setup the Analysis¶
The triclustering analysis of a threedimensional array Z
:
import numpy as np
Z = np.array([[[1., 1., 2., 4.],
[1., 1., 2., 4.]],
[[5., 5., 8., 8.],
[5., 5., 8., 8.]],
[[6., 7., 8., 9.],
[6., 7., 9., 8.]]])
is setup by creating an instance of Triclustering
:
from cgc.triclustering import Triclustering
tc = Triclustering(
Z, # data array (3D)
nclusters_row=4, # number of row clusters
nclusters_col=3, # number of column clusters
nclusters_bnd=2, # number of band clusters
max_iterations=100, # maximum number of iterations
conv_threshold=1.e5, # error convergence threshold
nruns=10, # number of differentlyinitialized runs
output_filename='results.json' # JSON file where to write output
)
The input arguments of Triclustering
are identical to the Coclustering
ones (see Coclustering) 
nclusters_bnd
is the only additional argument, which sets the maximum number of clusters along the ‘band’ dimension.
Note that a lower number of clusters can be identified by the algorithm (some of the clusters may remain empty).
Note
The first axis of Z
is assumed to represent the ‘band’ dimension.
Triclustering Implementations¶
Local (Numpybased)¶
As for the coclustering algorithm (see Coclustering), multiple runs of the triclustering algorithm can be efficiently computed in parallel using threads. In order to run the triclustering analysis using 4 threads:
results = tc.run_with_threads(nthreads=4)
As for coclustering, only one thread is spawned if the nthreads
argument is not provided.
Distributed (Daskbased)¶
Also for the triclustering, analysis on distributed systems can be carried out using Dask (see also Coclustering). Once the connection to a Dask cluster is setup:
from dask.distributed import Client
client = Client('tcp://daskscheduler:8786') # connect to the Dask scheduler
the triclustering analysis is carried out as:
results = tc.run_with_dask(client)
If no client is provided as argument, a default LocalCluster
is instantiated and made use of (see Dask docs).
Performance Comparison¶
This notebook presents a performance comparison of the two triclustering implementations for varying input data size and number of clusters. To test the Dask implementation, we have used a local threadbased cluster with four workers. As for coclustering, we find the Numpy implementation to be much faster (~2 orders of magnitude) than the Dask implementation for small datasets, where the Dask overhead dominates. However, when the system size becomes sufficiently large and/or the number of clusters is increased, the Dask implementation leads to shorter timings. It is important to stress here as well how the Dask implementation was not designed for improved performances, but to handle large datasets that could not be otherwise tackled due to memory limitations.
Results¶
The TriclusteringResults
object returned by Triclustering.run_with_threads
and Triclustering.run_with_dask
contains the final row, column, and band cluster assignments (results.row_clusters
, results.col_clusters
, and
results.bnd_clusters
, respectively) as well as the approximation error of the triclustering (results.error
).
Few other metadata are also present, including the input parameters employed to setup the analysis
(results.input_parameters
).
API¶

class
cgc.triclustering.
Triclustering
(Z, nclusters_row, nclusters_col, nclusters_bnd, conv_threshold=1e05, max_iterations=1, nruns=1, output_filename='', row_clusters_init=None, col_clusters_init=None, bnd_clusters_init=None)¶ Perform a triclustering analysis for a threedimensional array.
Parameters:  Z (numpy.ndarray or dask.array.Array) – Data array for which to run the triclustering analysis, with shape (band, row, column).
 nclusters_row (int) – Number of row clusters.
 nclusters_col (int) – Number of column clusters.
 nclusters_bnd (int) – Number of band clusters.
 conv_threshold (float, optional) – Convergence threshold for the objective function.
 max_iterations (int, optional) – Maximum number of iterations.
 nruns (int, optional) – Number of differentlyinitialized runs.
 output_filename (string, optional) – Name of the JSON file where to write the results.
 row_clusters_init (numpy.ndarray or array_like, optional) – Initial row cluster assignment.
 col_clusters_init (numpy.ndarray or array_like, optional) – Initial column cluster assignment.
 bnd_clusters_init (numpy.ndarray or array_like, optional) – Initial band cluster assignment.
Example: >>> import numpy as np >>> Z = np.random.randint(1, 100, size=(6, 10, 8)).astype('float64') >>> tc = Triclustering(Z, nclusters_row=5, nclusters_col=4, max_iterations=50, nruns=10)

run_serial
()¶

run_with_dask
(client=None)¶ Run the triclustering analysis using Dask.
Parameters: client (dask.distributed.Client, optional) – Dask client. If not specified, the default LocalCluster is employed. Returns: Triclustering results. Type: cgc.triclustering.TriclusteringResults

run_with_threads
(nthreads=1)¶ Run the triclustering using an algorithm based on Numpy plus threading (only suitable for local runs).
Parameters: nthreads (int, optional) – Number of threads employed to simultaneously run differentlyinitialized triclustering analysis. Returns: triclustering results. Type: cgc.triclustering.TriclusteringResults

class
cgc.triclustering.
TriclusteringResults
(**input_parameters)¶ Contains results and metadata of a triclustering calculation.
Variables:  row_clusters – Final row cluster assignment.
 col_clusters – Final column cluster assignment.
 bnd_clusters – Final band cluster assignment.
 error – Approximation error of the triclustering.
 nruns_completed – Number of successfully completed runs.
 nruns_converged – Number of converged runs.