Covariance and correlation matrices

We can use SciDB’s distributed parallel linear algebra operations to compute covariance and correlation matrices. This demonstrates SciDB-Py’s syntax for performing linear algebra and array arithmetic (with broadcasting). You can download this script here.

The following example computes the covariance and correlation between the columns of the matrices X and Y. We break the example up into a few parts for clarity.

Part 1, set up some example matrices

#  connect to the database
from scidbpy import connect
sdb = connect()

# Two small arrays, each with 1000 rows, and with 5 and 3 columns
x = np.random.random((1000, 5))
y = np.column_stack((x[:, 0] * 2, x[:, 1] + x[:, 0] / 2., x[:, 4]))

# Transfer to the database. All future computation happens there.
X = sdb.from_array(x)
Y = sdb.from_array(y)

Note that, given how y is defined, we expect the correlation matrix to be approximately equal to:

array([[1, 1/2, 0],
       [0, 1,   0],
       [0, 0,   0],
       [0, 0,   0],
       [0, 0,   1]])

Part 2, center the example matrices:

# Subtract the column means from X using broadcasting:
XC = X - X.mean(0)
# Similarly subtract the column means from Y:
YC = Y - Y.mean(0)

Note that X.mean(0) computes the sum for each of the five columns of X. These 5 numbers are subtracted from each row of X to compute XC.

Part 3, compute the covariance matrix:

COV = sdb.dot(XC.T, YC) / (X.shape[0] - 1)

Part 4, compute the correlation matrix:

# Column vector with column standard deviations of X matrix:
xsd = X.std(0).reshape((5, 1))
# Row vector with column standard deviations of Y matrix:
ysd = Y.std(0).reshape((1, 3))
# Their outer product:
outersd = sdb.dot(xsd, ysd)
COR = COV / outersd
print(COR.toarray())

Which prints:

[[ 1.          0.42008436 -0.01186421]
 [-0.02583465  0.89632943  0.00232589]
 [-0.02296872 -0.00927478  0.00570587]
 [-0.02791049 -0.00367841  0.06882341]
 [-0.01186421 -0.0031508   1.        ]]

The overhead of working interactively with SciDB can make these examples run somewhat slowly for small problems. But the same code shown here can be applied to arbitrarily large matrices, and those computations can run in parallel across a cluster.