Tri-clustering¶

Introduction¶

The triclustering module provides a generalization of the co-clustering algorithm to three-dimensional arrays (see Ref. 1). For geospatial data, tri-clustering analyses allow extending the search for similarity patterns in data cubes, thus accounting for an extra dimension (the ‘band’ dimension) in addition to space and time.

Setup the Analysis¶

The tri-clustering analysis of a three-dimensional array Z:

import numpy as np

Z = np.array([[[1., 1., 2., 4.],
               [1., 1., 2., 4.]],
              [[5., 5., 8., 8.],
               [5., 5., 8., 8.]],
              [[6., 7., 8., 9.],
               [6., 7., 9., 8.]]])

is setup by creating an instance of Triclustering:

from cgc.triclustering import Triclustering

tc = Triclustering(
    Z,  # data array (3D)
    nclusters_row=4,  # number of row clusters
    nclusters_col=3,  # number of column clusters
    nclusters_bnd=2,  # number of band clusters
    max_iterations=100,  # maximum number of iterations
    conv_threshold=1.e-5,  # error convergence threshold
    nruns=10,  # number of differently-initialized runs
    output_filename='results.json'  # JSON file where to write output
)

The input arguments of Triclustering are identical to the Coclustering ones (see Co-clustering) - nclusters_bnd is the only additional argument, which sets the maximum number of clusters along the ‘band’ dimension. Note that a lower number of clusters can be identified by the algorithm (some of the clusters may remain empty).

Note

The first axis of Z is assumed to represent the ‘band’ dimension.

Tri-clustering Implementations¶

Local (Numpy-based)¶

As for the co-clustering algorithm (see Co-clustering), multiple runs of the tri-clustering algorithm can be efficiently computed in parallel using threads. In order to run the tri-clustering analysis using 4 threads:

results = tc.run_with_threads(nthreads=4)

As for co-clustering, only one thread is spawned if the nthreads argument is not provided.

Distributed (Dask-based)¶

Also for the tri-clustering, analysis on distributed systems can be carried out using Dask (see also Co-clustering). Once the connection to a Dask cluster is setup:

from dask.distributed import Client

client = Client('tcp://daskscheduler:8786')  # connect to the Dask scheduler

the tri-clustering analysis is carried out as:

results = tc.run_with_dask(client)

If no client is provided as argument, a default LocalCluster is instantiated and made use of (see Dask docs).

Performance Comparison¶

This notebook presents a performance comparison of the two tri-clustering implementations for varying input data size and number of clusters. To test the Dask implementation, we have used a local thread-based cluster with four workers. As for co-clustering, we find the Numpy implementation to be much faster (~2 orders of magnitude) than the Dask implementation for small datasets, where the Dask overhead dominates. However, when the system size becomes sufficiently large and/or the number of clusters is increased, the Dask implementation leads to shorter timings. It is important to stress here as well how the Dask implementation was not designed for improved performances, but to handle large datasets that could not be otherwise tackled due to memory limitations.

Results¶

The TriclusteringResults object returned by Triclustering.run_with_threads and Triclustering.run_with_dask contains the final row, column, and band cluster assignments (results.row_clusters, results.col_clusters, and results.bnd_clusters, respectively) as well as the approximation error of the tri-clustering (results.error). Few other metadata are also present, including the input parameters employed to setup the analysis (results.input_parameters).

API¶

class cgc.triclustering.Triclustering(Z, nclusters_row, nclusters_col, nclusters_bnd, conv_threshold=1e-05, max_iterations=1, nruns=1, output_filename='', row_clusters_init=None, col_clusters_init=None, bnd_clusters_init=None)¶

Perform a tri-clustering analysis for a three-dimensional array.

Parameters

Z (numpy.ndarray or dask.array.Array) – Data array for which to run the tri-clustering analysis, with shape (band, row, column).
nclusters_row (int) – Number of row clusters.
nclusters_col (int) – Number of column clusters.
nclusters_bnd (int) – Number of band clusters.
conv_threshold (float, optional) – Convergence threshold for the objective function.
max_iterations (int, optional) – Maximum number of iterations.
nruns (int, optional) – Number of differently-initialized runs.
output_filename (string, optional) – Name of the JSON file where to write the results.
row_clusters_init (numpy.ndarray or array_like, optional) – Initial row cluster assignment.
col_clusters_init (numpy.ndarray or array_like, optional) – Initial column cluster assignment.
bnd_clusters_init (numpy.ndarray or array_like, optional) – Initial band cluster assignment.

Example

>>> import numpy as np
>>> Z = np.random.randint(1, 100, size=(6, 10, 8)).astype('float64')
>>> tc = Triclustering(Z,
                      nclusters_row=5,
                      nclusters_col=4,
                      max_iterations=50,
                      nruns=10)

run_serial()¶

run_with_dask(client=None)¶

Run the tri-clustering analysis using Dask.

Parameters: client (dask.distributed.Client, optional) – Dask client. If not specified, the default LocalCluster is employed.
Returns: Tri-clustering results.
Type: cgc.triclustering.TriclusteringResults

run_with_threads(nthreads=1)¶

Run the tri-clustering using an algorithm based on Numpy plus threading (only suitable for local runs).

Parameters: nthreads (int, optional) – Number of threads employed to simultaneously run differently-initialized tri-clustering analysis.
Returns: tri-clustering results.
Type: cgc.triclustering.TriclusteringResults

class cgc.triclustering.TriclusteringResults(**input_parameters)¶

Contains results and metadata of a tri-clustering calculation.

Variables

row_clusters – Final row cluster assignment.
col_clusters – Final column cluster assignment.
bnd_clusters – Final band cluster assignment.
error – Approximation error of the tri-clustering.
nruns_completed – Number of successfully completed runs.
nruns_converged – Number of converged runs.

References¶

1: Xiaojing Wu, Raul Zurita-Milla, Emma Izquierdo Verdiguier, Menno-Jan Kraak, Triclustering Georeferenced Time Series for Analyzing Patterns of Intra-Annual Variability in Temperature, Annals of the American Association of Geographers 108, 71 (2018)