# Basic Math on SciDB array objects¶

Operations on `SciDBArray`

objects generally return new
`SciDBArray`

objects.
The general idea is to promote function composition involving
`SciDBArray`

objects without moving data between SciDB and Python.

The `scidbpy`

package provides quite a few common operations including
subsetting, pointwise application of scalar functions, aggregations, and
pointwise and matrix arithmetic.

Standard numpy attributes like `shape`

, `ndim`

and `size`

are defined for
`SciDBArray`

objects:

```
>>> X = sdb.random((5, 10))
>>> X.shape
(5, 10)
>>> X.size
50
>>> X.ndim
2
```

Many SciDB-specific attributes are also defined,
including `chunk_size`

, `chunk_overlap`

, and `sdbtype`

,

```
>>> X.chunk_size
[1000, 1000]
>>> X.chunk_overlap
[0, 0]
>>> X.sdbtype
sdbtype('<f0:double>')
```

SciDBArrays also contain a `datashape`

object, which encapsulates much of the
interface between Python and SciDB data, including the full array schema:

```
>>> Xds = X.datashape
>>> Xds.schema
'<f0:double> [i0=0:4,1000,0,i1=0:9,1000,0]'
```

## Scalar functions of SciDBArray objects (aggregations)¶

The package exposes the following aggregations:

Name | Description |
---|---|

`min()` |
minimum value |

`max()` |
maximum value |

`sum()` |
sum of values |

`var()` |
variance of values |

`stdev()` |
standard deviation of values |

`std()` |
standard deviation of values |

`avg()` |
average/mean of values |

`mean()` |
average/mean of values |

`count()` |
count of nonempty cells |

`approxdc()` |
fast estimate of the number of distinct values |

**Examples: Minimum Aggregates**

Each operation can be computed across the entire array, or across specified dimensions by passing the index or indices of the desired dimensions. For example:

```
>>> np.random.seed(0)
>>> X = sdb.from_array(np.random.random((5, 3)))
>>> X.toarray()
array([[ 0.5488135 , 0.71518937, 0.60276338],
[ 0.54488318, 0.4236548 , 0.64589411],
[ 0.43758721, 0.891773 , 0.96366276],
[ 0.38344152, 0.79172504, 0.52889492],
[ 0.56804456, 0.92559664, 0.07103606]])
```

Here we’ll find the minimum of all values in the array. The returned result is a new SciDBArray, so we select the first element:

```
>>> X.min()[0]
0.071036058197886942
```

Like numpy, passing index 0 gives us the minimum within every column:

```
>>> X.min(0).toarray()
array([ 0.38344152, 0.4236548 , 0.07103606])
```

Passing index 1 gives us the minimum within every row:

```
>>> X.min(1).toarray()
array([ 0.5488135 , 0.4236548 , 0.43758721, 0.38344152, 0.07103606])
```

Note that the convention for specifying aggregate indices here is designed
to match numpy, and is *opposite the convention used within SciDB*.
To recover SciDB-style aggregates, you can use the `scidb_syntax`

flag:

```
>>> X.min(1, scidb_syntax=True).toarray()
array([ 0.38344152, 0.4236548 , 0.07103606])
```

**Further Examples**

These operations return new `SciDBArray`

objects consisting of
scalar values. Here are a few examples that materialize their results to
Python:

```
>>> tridiag.count()[0]
28
>>> tridiag.sum()[0]
20.0
>>> tridiag.var()[0]
1.6190476190476193
```

Note that a count of nonempty cells is also directly available from the
`nonempty()`

function:

```
>>> tridiag.nonempty()
28
```

A related function is `nonnull()`

, which counts the number of nonempty
cells which do not contain a null value. In this case, the result is the
same as `nonempty()`

:

```
>>> tridiag.nonnull()
28
```

## Pointwise application of scalar functions¶

The package exposes SciDB scalar-valued scalar functions that can be applied element-wise to SciDB arrays:

Function | Description |
---|---|

`sin()` |
Trigonometric sine |

`asin()` |
Trigonometric arc-sine / inverse sine |

`cos()` |
Trigonometric cosine |

`acos()` |
Trigonometric arc-cosine / inverse cosine |

`tan()` |
Trigonometric tangent |

`atan()` |
Trigonometric arc-tangent / inverse tagent |

`exp()` |
Natural exponent |

`log()` |
Natural logarithm |

`log10()` |
Base-10 logarithm |

`sqrt()` |
Square root |

`ceil()` |
Ceiling function |

`floor()` |
Floor function |

`is_nan()` |
Test for NaN values |

All trigonometric functions assume arguments are given in radians. Here is a simple example that compares a computation in SciDB with a local one (using the ‘tridiag` array defined in the last examples):

```
>>> sin_tri = sdb.sin(tridiag)
>>> np.linalg.norm(sin_tri.toarray() - np.sin(tridiag.toarray()))
0.0
```

## Shape and layout functions¶

Arrays may be transposed and their data re-arranged into new shapes
with the usual `transpose()`

and `reshape()`

functions:

```
>>> tri_reshape = tridiag.reshape((20,5))
>>> tri_reshape.shape
(20, 5)
>>> tri_reshape.transpose().shape
(5, 20)
>>> tri_reshape.T.shape # shortcut for transpose
(5, 20)
```

## Arithmetic¶

The package defines elementwise operations on all arrays and linear algebra operations on matrices and vectors. Scalar multiplication is supported.

Element-wise sums and products:

```
>>> np.random.seed(1)
>>> X = sdb.from_array(np.random.random((10, 10)))
>>> Y = sdb.from_array(np.random.random((10, 10)))
>>> S = X + Y
>>> D = X - Y
>>> M = 2 * X
>>> (S + D - M).sum()[0]
-1.1102230246251565e-16
```

We can combine operations as well:

```
>>> Z = 0.5 * (X + X.T)
```

There are also linear algebra operations (matrix-matrix product, matrix-vector
product) using the `dot()`

function:

```
>>> XY = sdb.dot(X, Y)
>>> XY1 = sdb.dot(X, Y[:,1])
>>> XTX = sdb.dot(X.T, X)
```

## Broadcasting¶

Numpy broadcasting conventions are generally followed in operations involving
differently-sized `SciDBArray`

objects. Consider the following example
that centers a matrix by subtracting its column average from each column.

First we create a test array with 5 columns:

```
>>> np.random.seed(0)
>>> X = sdb.from_array(np.random.random((10, 5)))
```

Now create a vector of column means:

```
>>> xcolmean = X.mean(0)
>>> xcolmean.shape
(5,)
```

Subtract these means from the columns – this is a broadcasting operation:

```
>>> XC = X - xcolmean
```

To check that the columns are now centered,
we compute the column mean of `XC`

:

```
>>> XC.mean(1).toarray()
array([ -2.22044605e-17, 4.44089210e-17, -1.11022302e-17,
1.11022302e-16, -3.33066907e-17])
```

The broadcasting operation which creates `XC`

is implemented using a
join operation along dimension 1.

## Lazy Evaluation¶

When possible, SciDB-Py defers actual database computation
until data are needed. It does this by using **lazy arrays**, which are
references to as-yet unevaluated SciDB queries. Many array methods
actually return lazy arrays:

```
>>> x = sdb.random((3,4))
>>> x.name # an array in the database
'py1102522658694_00001'
>>> y = x.mean(0)
>>> y.name # not yet in the database
'aggregate(py1102522658694_00001,avg(f0),i1)'
```

Note that y’s name doesn’t refer to an array in the database, but rather a query on x. Lazy arrays can also be identified by their non-null query attribute:

```
>>> y.query
'aggregate(py1102522658694_00001,avg(f0),i1)'
>>> x.query is None
True
```

Calling `eval()`

forces lazy-arrays to be evaluated (it has no effect
on non-lazy arrays):

```
>>> y.eval()
>>> y.name
'py1102522658694_00014'
```

In most cases you don’t need to worry about whether an array is lazy or not – lazy arrays have all the same methods as regular arrays, and normally the difference is transparent to the user. However, lazy arrays can be more efficient with regard to compound queries. Consider an equation like the law of cosines:

```
c2 = a ** 2 + b ** 2 - 2 * a * b * sdb.cos(C)
```

This equation involves creating 7 intermediate data products:

`t1 = a ** 2`

`t2 = b ** 2`

`t3 = 2 * a`

`t4 = t3 * b`

`t5 = sdb.cos(C)`

`t6 = t4 * t5`

`t7 = t1 + t2`

`c2 = t7 - t6`

If `a`

, `b`

, and `C`

are large SciDBArrays, this involves many
round-trip communiciations to the databse, several passes over the data,
and the storage of 7 arrays. Lazy arrays reduce this overhead by representing
some of these temporary arrays as unevaluated sub-queries. Passing larger
queries to SciDB at once also gives the database more opportunity to optimize
the final query, performing the computation in fewer passes over the data.

In some situations it’s necessary or more efficient to
force evaluation of lazy arrays (often places where an array appears
several times in a complex query). Some SciDB-Py methods perform this
evaluation internally. You should also consider calling `eval()`

on lazy
arrays if you think the unevaluated queries are becoming too cumbersome.