# Tri-clustering¶

## Introduction¶

The `triclustering` module provides a generalization of the co-clustering algorithm to three-dimensional arrays (see Ref. 1). For geospatial data, tri-clustering analyses allow extending the search for similarity patterns in data cubes, thus accounting for an extra dimension (the ‘band’ dimension) in addition to space and time.

## Setup the Analysis¶

The tri-clustering analysis of a three-dimensional array `Z`:

```import numpy as np

Z = np.array([[[1., 1., 2., 4.],
[1., 1., 2., 4.]],
[[5., 5., 8., 8.],
[5., 5., 8., 8.]],
[[6., 7., 8., 9.],
[6., 7., 9., 8.]]])
```

is setup by creating an instance of `Triclustering`:

```from cgc.triclustering import Triclustering

tc = Triclustering(
Z,  # data array (3D)
nclusters_row=4,  # number of row clusters
nclusters_col=3,  # number of column clusters
nclusters_bnd=2,  # number of band clusters
max_iterations=100,  # maximum number of iterations
conv_threshold=1.e-5,  # error convergence threshold
nruns=10,  # number of differently-initialized runs
output_filename='results.json'  # JSON file where to write output
)
```

The input arguments of `Triclustering` are identical to the `Coclustering` ones (see Co-clustering) - `nclusters_bnd` is the only additional argument, which sets the maximum number of clusters along the ‘band’ dimension. Note that a lower number of clusters can be identified by the algorithm (some of the clusters may remain empty).

Note

The first axis of `Z` is assumed to represent the ‘band’ dimension.

## Tri-clustering Implementations¶

### Local (Numpy-based)¶

As for the co-clustering algorithm (see Co-clustering), multiple runs of the tri-clustering algorithm can be efficiently computed in parallel using threads. In order to run the tri-clustering analysis using 4 threads:

```results = tc.run_with_threads(nthreads=4)
```

As for co-clustering, only one thread is spawned if the `nthreads` argument is not provided.

Also for the tri-clustering, analysis on distributed systems can be carried out using Dask (see also Co-clustering). Once the connection to a Dask cluster is setup:

```from dask.distributed import Client

```

the tri-clustering analysis is carried out as:

```results = tc.run_with_dask(client)
```

If no client is provided as argument, a default `LocalCluster` is instantiated and made use of (see Dask docs).

### Performance Comparison¶

This notebook presents a performance comparison of the two tri-clustering implementations for varying input data size and number of clusters. To test the Dask implementation, we have used a local thread-based cluster with four workers. As for co-clustering, we find the Numpy implementation to be much faster (~2 orders of magnitude) than the Dask implementation for small datasets, where the Dask overhead dominates. However, when the system size becomes sufficiently large and/or the number of clusters is increased, the Dask implementation leads to shorter timings. It is important to stress here as well how the Dask implementation was not designed for improved performances, but to handle large datasets that could not be otherwise tackled due to memory limitations.

## Results¶

The `TriclusteringResults` object returned by `Triclustering.run_with_threads` and `Triclustering.run_with_dask` contains the final row, column, and band cluster assignments (`results.row_clusters`, `results.col_clusters`, and `results.bnd_clusters`, respectively) as well as the approximation error of the tri-clustering (`results.error`). Few other metadata are also present, including the input parameters employed to setup the analysis (`results.input_parameters`).

## API¶

class cgc.triclustering.Triclustering(Z, nclusters_row, nclusters_col, nclusters_bnd, conv_threshold=1e-05, max_iterations=1, nruns=1, output_filename='', row_clusters_init=None, col_clusters_init=None, bnd_clusters_init=None)

Perform a tri-clustering analysis for a three-dimensional array.

Parameters
• Z (numpy.ndarray or dask.array.Array) – Data array for which to run the tri-clustering analysis, with shape (band, row, column).

• nclusters_row (int) – Number of row clusters.

• nclusters_col (int) – Number of column clusters.

• nclusters_bnd (int) – Number of band clusters.

• conv_threshold (float, optional) – Convergence threshold for the objective function.

• max_iterations (int, optional) – Maximum number of iterations.

• nruns (int, optional) – Number of differently-initialized runs.

• output_filename (string, optional) – Name of the JSON file where to write the results.

• row_clusters_init (numpy.ndarray or array_like, optional) – Initial row cluster assignment.

• col_clusters_init (numpy.ndarray or array_like, optional) – Initial column cluster assignment.

• bnd_clusters_init (numpy.ndarray or array_like, optional) – Initial band cluster assignment.

Example

```>>> import numpy as np
>>> Z = np.random.randint(1, 100, size=(6, 10, 8)).astype('float64')
>>> tc = Triclustering(Z,
nclusters_row=5,
nclusters_col=4,
max_iterations=50,
nruns=10)
```
run_serial()

Run the tri-clustering analysis using Dask.

Parameters

client (dask.distributed.Client, optional) – Dask client. If not specified, the default LocalCluster is employed.

Returns

Tri-clustering results.

Type

cgc.triclustering.TriclusteringResults

Run the tri-clustering using an algorithm based on Numpy plus threading (only suitable for local runs).

Parameters

nthreads (int, optional) – Number of threads employed to simultaneously run differently-initialized tri-clustering analysis.

Returns

tri-clustering results.

Type

cgc.triclustering.TriclusteringResults

class cgc.triclustering.TriclusteringResults(**input_parameters)

Contains results and metadata of a tri-clustering calculation.

Variables
• row_clusters – Final row cluster assignment.

• col_clusters – Final column cluster assignment.

• bnd_clusters – Final band cluster assignment.

• error – Approximation error of the tri-clustering.

• nruns_completed – Number of successfully completed runs.

• nruns_converged – Number of converged runs.