# Comparing and Filtering Arrays¶

SciDB-Py provides support for comparing and filtering SciDB arrays, using NumPy-like syntax.

The standard inequality operators perform element-wise inequality testing between SciDB arrays, NumPy arrays, and scalars:

In [1]: from scidbpy import connect

In [2]: sdb = connect()

In [3]: x = sdb.arange(5)

In [4]: x.toarray()
Out[4]: array([0, 1, 2, 3, 4])

In [5]: (x < 2).toarray()
Out[5]: array([ True,  True, False, False, False], dtype=bool)

In [6]: y = np.array([5, 0, 3, 2, 1])

In [7]: (x < y).toarray()
Out[7]: array([ True, False,  True, False, False], dtype=bool)

In [13]: z = sdb.from_array(y)

In [14]: (x < z).toarray()
Out[14]: array([ True, False,  True, False, False], dtype=bool)


Array broadcasting is not currently performed when comparing two arrays – they must have identical shapes.

As with NumPy arrays, boolean SciDB arrays can be used as masks:

In [4]: r = sdb.random((3,4))

In [5]: r.toarray()
Out[5]:
array([[ 0.72039148,  0.6497302 ,  0.84122248,  0.87304017],
[ 0.14896572,  0.71237498,  0.21999935,  0.14793879],
[ 0.69345283,  0.18611741,  0.43660223,  0.06478555]])

In [6]: r[r > 0.5].toarray()
Out[6]:
array([[ 0.72039148,  0.6497302 ,  0.84122248,  0.87304017],
[ 0.        ,  0.71237498,  0.        ,  0.        ],
[ 0.69345283,  0.        ,  0.        ,  0.        ]])


Note that this masking behavior is different than NumPy – NumPy collapses the input array when masking, returning a 1D result of unmasked items. To reproduce this behavior in SciDB-Py, use the collapse() method:

In [9]: r[r > 0.5].collapse().toarray()
Out[9]:
array([ 0.72039148,  0.6497302 ,  0.84122248,  0.87304017,  0.71237498,
0.69345283])


SciDB-Py behaves this way in order to retain the location of unmasked items, which is often useful information. For example, we can see these locations when using todataframe():

In [10]: r[r > 0.5].todataframe()
Out[10]:
f0
i0 i1
0  0   0.720391
1   0.649730
2   0.841222
3   0.873040
1  1   0.712375
2  0   0.693453


Note

SciDB-Py’s masking behavior was changed in version 12.10. Prior to this, SciDB-Py collapsed results like NumPy

## Extracting values along a particular axis¶

Use the SciDBArray.compress() method to extract row or column subsets of an array. For example, to extract all rows where the sum across all columns exceeds a threshold:

In [3]: x = sdb.random((3,4))

In [4]: x.toarray()
Out[4]:
array([[ 0.10111977,  0.5511177 ,  0.49532397,  0.4213646 ],
[ 0.3812068 ,  0.97679566,  0.20473656,  0.40256096],
[ 0.2387294 ,  0.88714084,  0.01064819,  0.48275173]])

In [5]: x.mean(1).toarray()
Out[5]: array([ 0.39223151,  0.491325  ,  0.40481754])

In [6]: x.compress(x.mean(1) > 0.4, axis=0).toarray()
Out[6]:
array([[ 0.3812068 ,  0.97679566,  0.20473656,  0.40256096],
[ 0.2387294 ,  0.88714084,  0.01064819,  0.48275173]])


## Aggregation based on masked values¶

A future version of SciDB-Py will provide a groupby operator, allowing comparisons to be used to compute group-wise aggregates:

sdb.groupby(x, x < 0.5).sum()


Until that method is added, you can perform the same computation with two aggregate calls:

mask = x < 0.5