Basic Math on SciDB array objects

Operations on SciDBArray objects generally return new SciDBArray objects. The general idea is to promote function composition involving SciDBArray objects without moving data between SciDB and Python.

The scidbpy package provides quite a few common operations including subsetting, pointwise application of scalar functions, aggregations, and pointwise and matrix arithmetic.

Standard numpy attributes like shape, ndim and size are defined for SciDBArray objects:

>>> X = sdb.random((5, 10))
>>> X.shape
(5, 10)
>>> X.size
50
>>> X.ndim
2

Many SciDB-specific attributes are also defined, including chunk_size, chunk_overlap, and sdbtype,

>>> X.chunk_size
[1000, 1000]
>>> X.chunk_overlap
[0, 0]
>>> X.sdbtype
sdbtype('<f0:double>')

SciDBArrays also contain a datashape object, which encapsulates much of the interface between Python and SciDB data, including the full array schema:

>>> Xds = X.datashape
>>> Xds.schema
'<f0:double> [i0=0:4,1000,0,i1=0:9,1000,0]'

Scalar functions of SciDBArray objects (aggregations)

The package exposes the following aggregations:

Name Description
min() minimum value
max() maximum value
sum() sum of values
var() variance of values
stdev() standard deviation of values
std() standard deviation of values
avg() average/mean of values
mean() average/mean of values
count() count of nonempty cells
approxdc() fast estimate of the number of distinct values

Examples: Minimum Aggregates

Each operation can be computed across the entire array, or across specified dimensions by passing the index or indices of the desired dimensions. For example:

>>> np.random.seed(0)
>>> X = sdb.from_array(np.random.random((5, 3)))
>>> X.toarray()
array([[ 0.5488135 ,  0.71518937,  0.60276338],
       [ 0.54488318,  0.4236548 ,  0.64589411],
       [ 0.43758721,  0.891773  ,  0.96366276],
       [ 0.38344152,  0.79172504,  0.52889492],
       [ 0.56804456,  0.92559664,  0.07103606]])

Here we’ll find the minimum of all values in the array. The returned result is a new SciDBArray, so we select the first element:

>>> X.min()[0]
0.071036058197886942

Like numpy, passing index 0 gives us the minimum within every column:

>>> X.min(0).toarray()
array([ 0.38344152,  0.4236548 ,  0.07103606])

Passing index 1 gives us the minimum within every row:

>>> X.min(1).toarray()
array([ 0.5488135 ,  0.4236548 ,  0.43758721,  0.38344152,  0.07103606])

Note that the convention for specifying aggregate indices here is designed to match numpy, and is opposite the convention used within SciDB. To recover SciDB-style aggregates, you can use the scidb_syntax flag:

>>> X.min(1, scidb_syntax=True).toarray()
array([ 0.38344152,  0.4236548 ,  0.07103606])

Further Examples

These operations return new SciDBArray objects consisting of scalar values. Here are a few examples that materialize their results to Python:

>>> tridiag.count()[0]
28
>>> tridiag.sum()[0]
20.0
>>> tridiag.var()[0]
1.6190476190476193

Note that a count of nonempty cells is also directly available from the nonempty() function:

>>> tridiag.nonempty()
28

A related function is nonnull(), which counts the number of nonempty cells which do not contain a null value. In this case, the result is the same as nonempty():

>>> tridiag.nonnull()
28

Pointwise application of scalar functions

The package exposes SciDB scalar-valued scalar functions that can be applied element-wise to SciDB arrays:

Function Description
sin() Trigonometric sine
asin() Trigonometric arc-sine / inverse sine
cos() Trigonometric cosine
acos() Trigonometric arc-cosine / inverse cosine
tan() Trigonometric tangent
atan() Trigonometric arc-tangent / inverse tagent
exp() Natural exponent
log() Natural logarithm
log10() Base-10 logarithm
sqrt() Square root
ceil() Ceiling function
floor() Floor function
is_nan() Test for NaN values

All trigonometric functions assume arguments are given in radians. Here is a simple example that compares a computation in SciDB with a local one (using the ‘tridiag` array defined in the last examples):

>>> sin_tri = sdb.sin(tridiag)
>>> np.linalg.norm(sin_tri.toarray() - np.sin(tridiag.toarray()))
0.0

Shape and layout functions

Arrays may be transposed and their data re-arranged into new shapes with the usual transpose() and reshape() functions:

>>> tri_reshape = tridiag.reshape((20,5))
>>> tri_reshape.shape
(20, 5)
>>> tri_reshape.transpose().shape
(5, 20)
>>> tri_reshape.T.shape  # shortcut for transpose
(5, 20)

Arithmetic

The package defines elementwise operations on all arrays and linear algebra operations on matrices and vectors. Scalar multiplication is supported.

Element-wise sums and products:

>>> np.random.seed(1)
>>> X = sdb.from_array(np.random.random((10, 10)))
>>> Y = sdb.from_array(np.random.random((10, 10)))
>>> S = X + Y
>>> D = X - Y
>>> M = 2 * X
>>> (S + D - M).sum()[0]
-1.1102230246251565e-16

We can combine operations as well:

>>> Z = 0.5 * (X + X.T)

There are also linear algebra operations (matrix-matrix product, matrix-vector product) using the dot() function:

>>> XY = sdb.dot(X, Y)
>>> XY1 = sdb.dot(X, Y[:,1])
>>> XTX = sdb.dot(X.T, X)

Broadcasting

Numpy broadcasting conventions are generally followed in operations involving differently-sized SciDBArray objects. Consider the following example that centers a matrix by subtracting its column average from each column.

First we create a test array with 5 columns:

>>> np.random.seed(0)
>>> X = sdb.from_array(np.random.random((10, 5)))

Now create a vector of column means:

>>> xcolmean = X.mean(0)
>>> xcolmean.shape
(5,)

Subtract these means from the columns – this is a broadcasting operation:

>>> XC = X - xcolmean

To check that the columns are now centered, we compute the column mean of XC:

>>> XC.mean(1).toarray()
array([ -2.22044605e-17,   4.44089210e-17,  -1.11022302e-17,
         1.11022302e-16,  -3.33066907e-17])

The broadcasting operation which creates XC is implemented using a join operation along dimension 1.

Lazy Evaluation

When possible, SciDB-Py defers actual database computation until data are needed. It does this by using lazy arrays, which are references to as-yet unevaluated SciDB queries. Many array methods actually return lazy arrays:

>>> x = sdb.random((3,4))
>>> x.name  # an array in the database
'py1102522658694_00001'
>>> y = x.mean(0)
>>> y.name  # not yet in the database
'aggregate(py1102522658694_00001,avg(f0),i1)'

Note that y’s name doesn’t refer to an array in the database, but rather a query on x. Lazy arrays can also be identified by their non-null query attribute:

>>> y.query
'aggregate(py1102522658694_00001,avg(f0),i1)'
>>> x.query is None
True

Calling eval() forces lazy-arrays to be evaluated (it has no effect on non-lazy arrays):

>>> y.eval()
>>> y.name
'py1102522658694_00014'

In most cases you don’t need to worry about whether an array is lazy or not – lazy arrays have all the same methods as regular arrays, and normally the difference is transparent to the user. However, lazy arrays can be more efficient with regard to compound queries. Consider an equation like the law of cosines:

c2 = a ** 2 + b ** 2 - 2 * a * b * sdb.cos(C)

This equation involves creating 7 intermediate data products:

  • t1 = a ** 2
  • t2 = b ** 2
  • t3 = 2 * a
  • t4 = t3 * b
  • t5 = sdb.cos(C)
  • t6 = t4 * t5
  • t7 = t1 + t2
  • c2 = t7 - t6

If a, b, and C are large SciDBArrays, this involves many round-trip communiciations to the databse, several passes over the data, and the storage of 7 arrays. Lazy arrays reduce this overhead by representing some of these temporary arrays as unevaluated sub-queries. Passing larger queries to SciDB at once also gives the database more opportunity to optimize the final query, performing the computation in fewer passes over the data.

In some situations it’s necessary or more efficient to force evaluation of lazy arrays (often places where an array appears several times in a complex query). Some SciDB-Py methods perform this evaluation internally. You should also consider calling eval() on lazy arrays if you think the unevaluated queries are becoming too cumbersome.