Utility Functions

API

cgc.utils.mem_estimate_coclustering_numpy(n_rows, n_cols, nclusters_row, nclusters_col, out_unit=None)

Estimate the maximum memory usage of cgc.coclustering_numpy, given the matrix size (n_rows, n_cols) and the number of row/column clusters (nclusters_row, nclusters_col).

The estimated memory usage is the sum of the size of all major arrays simultaneously allocated in cgc.coclustering_numpy.coclustering.

Depending on the shape of the data matrix, there are two possible memory peaks, corresponding to either the first or the second call to cgc.coclustering_numpy._distance().

Parameters
  • n_rows (int) – Number of rows in the data matrix.

  • n_cols (int) – Number of columns in the data matrix.

  • nclusters_row (int) – Number of row clusters.

  • nclusters_col (int) – Number of column clusters.

  • out_unit (str) – Output units, choose between “B”, “KB”, “MB”, “GB”

Returns

Estimated memory usage, unit, peak.

Type

tuple

cgc.utils.calculate_cocluster_averages(Z, row_clusters, col_clusters, nclusters_row=None, nclusters_col=None)

Calculate the co-cluster averages from the data array and the row- and column-cluster assignments.

Parameters
  • Z (numpy.ndarray or dask.array.Array) – Data matrix.

  • row_clusters (numpy.ndarray or array_like) – Row cluster assignment.

  • col_clusters (numpy.ndarray or array_like) – Column cluster assignment.

  • nclusters_row (int, optional) – Number of row clusters. If not provided, it is set as the number of unique elements in row_clusters.

  • nclusters_col (int, optional) – Number of column clusters. If not provided, it is set as the number of unique elements in col_clusters.

Returns

Array with co-cluster averages, shape (nclusters_row, nclusters_col). Elements corresponding to empty co-clusters are set as NaN.

Type

numpy.ndarray

cgc.utils.calculate_tricluster_averages(Z, row_clusters, col_clusters, bnd_clusters, nclusters_row=None, nclusters_col=None, nclusters_bnd=None)

Calculate the tri-cluster averages from the data array and the band-, row- and column-cluster assignments.

Parameters
  • Z (numpy.ndarray or dask.array.Array) – Data array, with shape (bands, rows, columns).

  • row_clusters (numpy.ndarray or array_like) – Row cluster assignment.

  • col_clusters (numpy.ndarray or array_like) – Column cluster assignment.

  • bnd_clusters (numpy.ndarray or array_like) – Band cluster assignment.

  • nclusters_row (int, optional) – Number of row clusters. If not provided, it is set as the number of unique elements in row_clusters.

  • nclusters_col (int, optional) – Number of column clusters. If not provided, it is set as the number of unique elements in col_clusters.

  • nclusters_bnd (int, optional) – Number of band clusters. If not provided, it is set as the number of unique elements in col_clusters.

Returns

Array with tri-cluster averages, shape (nclusters_bnd, nclusters_row, nclusters_col). Elements corresponding to empty tri-clusters are set as NaN.

Type

numpy.ndarray

cgc.utils.calculate_cluster_feature(Z, function, clusters, nclusters=None, **kwargs)

Calculate features for clusters. This function works in N dimensions (N=2, 3, …) making it suitable to calculate features for both co-clusters and tri-clusters.

Parameters
  • Z (numpy.ndarray or dask.array.Array) – Data array (N dimensions).

  • function (Callable) – Function to run over the cluster elements to calculate the desired feature. Should take as an input a N-dimensional array and return a scalar.

  • clusters (tuple, list, or numpy.ndarray) – Iterable with length N. It should contain the cluster labels for each dimension, following the same ordering as for Z

  • nclusters (tuple, list, or numpy.ndarray, optional) – Iterable with length N. It should contains the number of clusters in each dimension, following the same ordering as for Z. If not provided, it is set as the number of unique elements in each dimension.

  • kwargs (dict, optional) – keyword arguments to be passed to the input function together with the input data array for each cluster

Returns

the desired feature is computed for each cluster and added to an array with N dimensions. It has dimension N and shape equal to nclusters.

Type

numpy.ndarray