Covariance and correlation matrices¶
We can use SciDB’s distributed parallel linear algebra operations to
compute covariance and correlation matrices. This demonstrates SciDB-Py’s
syntax for performing linear algebra and array arithmetic (with broadcasting).
You can download this script
The following example computes the covariance and correlation between the columns of the matrices X and Y. We break the example up into a few parts for clarity.
Part 1, set up some example matrices
# connect to the database from scidbpy import connect sdb = connect() # Two small arrays, each with 1000 rows, and with 5 and 3 columns x = np.random.random((1000, 5)) y = np.column_stack((x[:, 0] * 2, x[:, 1] + x[:, 0] / 2., x[:, 4])) # Transfer to the database. All future computation happens there. X = sdb.from_array(x) Y = sdb.from_array(y)
Note that, given how y is defined, we expect the correlation matrix to be approximately equal to:
array([[1, 1/2, 0], [0, 1, 0], [0, 0, 0], [0, 0, 0], [0, 0, 1]])
Part 2, center the example matrices:
# Subtract the column means from X using broadcasting: XC = X - X.mean(0) # Similarly subtract the column means from Y: YC = Y - Y.mean(0)
X.mean(0) computes the sum for each of the five columns of X.
These 5 numbers are subtracted from each row of
X to compute
Part 3, compute the covariance matrix:
COV = sdb.dot(XC.T, YC) / (X.shape - 1)
Part 4, compute the correlation matrix:
# Column vector with column standard deviations of X matrix: xsd = X.std(0).reshape((5, 1)) # Row vector with column standard deviations of Y matrix: ysd = Y.std(0).reshape((1, 3)) # Their outer product: outersd = sdb.dot(xsd, ysd) COR = COV / outersd print(COR.toarray())
[[ 1. 0.42008436 -0.01186421] [-0.02583465 0.89632943 0.00232589] [-0.02296872 -0.00927478 0.00570587] [-0.02791049 -0.00367841 0.06882341] [-0.01186421 -0.0031508 1. ]]
The overhead of working interactively with SciDB can make these examples run somewhat slowly for small problems. But the same code shown here can be applied to arbitrarily large matrices, and those computations can run in parallel across a cluster.