Isin #2031

max-sixty · 2018-03-30T05:07:12Z

Tests added (for all bug fixes or enhancements)
Tests passed (for all non-documentation changes)
Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

This is an initial implementation of isin. It works for DataArrays, but isn't yet implemented further.

I've been away from these sorts features for too long: Is there a canonical place to put these? What's the canonical approach to extending to datasets? I can see a few approaches in the code.

max-sixty · 2018-03-30T17:12:56Z

Fails on Numpy pre 1.13. Is that too recent to upgrade min version? 1.14.2 is current, so would be aggressive

shoyer · 2018-03-30T15:42:40Z

xarray/core/dataarray.py

@@ -2113,6 +2113,15 @@ def rank(self, dim, pct=False, keep_attrs=False):
        ds = self._to_temp_dataset().rank(dim, pct=pct, keep_attrs=keep_attrs)
        return self._from_temp_dataset(ds)

+    def isin(self, test_elements):


You could put this in xarray/core/common.py, maybe on DataWithCoords?

Great. Anyone recall the newest way to run across Variables in a Dataset (i.e. more standard than a loop)?

I'm sure I saw it in a recent PR but can't find it, either in the code or the recent PRs

Otherwise loop is fine

apply_ufunc can handle an xarray.Dataset as an argument.

Are you thinking of using a Dataset for test_elements as well?

shoyer · 2018-03-30T15:43:39Z

xarray/core/dataarray.py

+            np.isin,
+            self,
+            kwargs=dict(test_elements=test_elements),
+        )


consider adding dask='parallelized', output_dtypes=[np.bool_] to make this work with dask.

shoyer · 2018-03-30T17:24:41Z

Let's just skip the tests if numpy is too old.

This reverts commit 75493c2.

stickler-ci · 2018-03-30T21:34:10Z

xarray/core/dataarray.py

@@ -2,6 +2,7 @@

 import functools
 import warnings
+from distutils.version import LooseVersion


F401 'distutils.version.LooseVersion' imported but unused

max-sixty · 2018-03-30T22:38:54Z

xarray/tests/test_dataset.py

+        .sel(dim2=0, dim3='a')
+        .isel(dim1=[0, 1])
+        .drop(['time', 'dim3', 'dim2', 'numbers'])
+        .squeeze()


Open to feedback on this formatting...

I would rather just construct the desired dataset directly rather than jumping through hoops to reuse the fixture data, e.g.,

result = Dataset( data_vars={ 'var1': (('dim1',), [0, 1]), 'var2': (('dim1',), [1, 2]), 'var3': (('dim1',), [0, 1]), } ).isin([1, 2])

(In general, it's better if tests have less logic.)

max-sixty · 2018-03-30T23:08:12Z

Any thoughts on this approach of writing out the result on a slice of a sample dataset / dataarray?

I've been thinking about expect tests, as described by @yminsky here.
That would be something like:

Have some example datasets (similar to what we do now, though with a well known seed)
Run our functions and save to a file, as a known good output
During tests, compare the result to the known good output
Where different, raise and show the diff

That's a bit harder with numerical data than with small lists of words (the example in the link), but also helpful - we don't have to manually construct the result in python - just check the first time & commit the result. And would enable tests across moderately sized data, rather than only 'toy' examples.

Makes `np.asarray(dataset)` issue an informative error. Currently, `np.asarray(xr.Dataset({'x': 0}))` raises `KeyError: 0`, which makes no sense.

shoyer · 2018-03-31T00:56:05Z

xarray/tests/test_dataset.py

+        .sel(dim2=0, dim3='a')
+        .isel(dim1=[0, 1])
+        .drop(['time', 'dim3', 'dim2', 'numbers'])
+        .squeeze()


I would rather just construct the desired dataset directly rather than jumping through hoops to reuse the fixture data, e.g.,

result = Dataset( data_vars={ 'var1': (('dim1',), [0, 1]), 'var2': (('dim1',), [1, 2]), 'var3': (('dim1',), [0, 1]), } ).isin([1, 2])

(In general, it's better if tests have less logic.)

shoyer · 2018-03-31T00:59:21Z

xarray/tests/test_dataarray.py

+    ).astype('bool')
+    result = da.isin([2, 3]).sel(y=list('de'), z=0)
+    assert_equal(result, expected)
+


can you add another test for the dask path, e.g., that calls .chunk() on the input and verifies it gives the same result after .compute()? That will need a skipif for dask.

(done in test_dataset, lmk if better to put in both)

shoyer · 2018-03-31T01:02:01Z

xarray/core/common.py

+        return apply_ufunc(
+            np.isin,
+            self,
+            kwargs=dict(test_elements=test_elements),


It's probably a better idea to explicitly unwrap .data from test_elements if it's an xarray object, and explicitly raise for xarray.Dataset. Otherwise numpy will probably give a really strange error message.

Indeed, see #2032 for what converting a Dataset to a numpy array does.

I merged your branch - is that better than an additional check?

Why extract .data - won't the standard machinery take care of that? I added a test

I suppose you're right, the standard machinery will work fine here for now. In the future when dask supports isin (dask/dask#3363) we'll want to use .data so we can keep it as a dask array.

shoyer

This looks good to me now! I just merged #2032 separately, so after merging master this should be good to go in.

max-sixty · 2018-04-03T13:40:57Z

Green! @shoyer

max-sixty · 2018-04-03T22:45:47Z

I'll merge this later tonight given @shoyer 's previous approval, unless there's any feedback

shoyer · 2018-04-04T02:48:20Z

Thanks @maxim-lian.

As a follow-up, it might be nice to include an example showing how to use this for indexing non-dimensions in the narrative docs somewhere -- maybe in the section on where?

dcherian · 2018-04-04T03:07:27Z

@shoyer the cookbook might be a good place for that

shoyer · 2018-04-04T03:10:29Z

Indeed, but combined where/isin is also basically equivalent to indexing, so I think it's appropriate on that doc page, too.

max-sixty · 2018-04-04T03:52:21Z

Yes good idea. I'll add that to my (metaphorical) list.

max-sixty added 2 commits March 29, 2018 22:38

gitignore testmon

45dfadd

initial isin implementation

a8fbe2b

shoyer reviewed Mar 30, 2018

View reviewed changes

max-sixty added 8 commits March 30, 2018 14:57

gitignore

267f02c

dask

4411abe

numpy version check not needed

bcaaaf7

numpy version check for isin

ffa5e1d

move to common

a183c46

rename data_set to ds

75493c2

Revert "rename data_set to ds"

4da73c6

This reverts commit 75493c2.

'expect' test for dataset

6fb9c48

stickler-ci reviewed Mar 30, 2018

View reviewed changes

max-sixty added 2 commits March 30, 2018 17:36

unneeded import

616db23

formatting

60fbd5b

max-sixty commented Mar 30, 2018

View reviewed changes

max-sixty and others added 2 commits March 30, 2018 19:33

docs

34dabe1

Raise an informative error message when converting Dataset -> np.ndarray

0cc1752

Makes `np.asarray(dataset)` issue an informative error. Currently, `np.asarray(xr.Dataset({'x': 0}))` raises `KeyError: 0`, which makes no sense.

shoyer reviewed Mar 31, 2018

View reviewed changes

max-sixty added 7 commits March 30, 2018 22:32

normal tests are better than a weird middle ground

41d73b0

Merge branch 'ds-error' into isin

8398858

Merge remote-tracking branch 'upstream/master' into isin

416e67d

dask test

581b3c2

grammar

0a24429

try changing skip decorator ordering

f1768cd

just use has_dask

80557c5

shoyer approved these changes Apr 3, 2018

View reviewed changes

another noqa?

d9c9907

max-sixty mentioned this pull request Apr 3, 2018

Drop support for Python 3.4 #1829

Closed

max-sixty added 3 commits April 3, 2018 01:52

Merge branch 'master' of https://github.com/pydata/xarray into isin

9c78417

flake for py3.4

dd0136d

flake

43532df

shoyer merged commit a5f7d6a into pydata:master Apr 4, 2018

max-sixty deleted the isin branch April 4, 2018 03:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Isin #2031

Isin #2031

max-sixty commented Mar 30, 2018 •

edited

Loading

max-sixty commented Mar 30, 2018

shoyer Mar 30, 2018

max-sixty Mar 30, 2018

shoyer Mar 30, 2018

shoyer Mar 30, 2018

shoyer commented Mar 30, 2018

stickler-ci Mar 30, 2018

max-sixty Mar 30, 2018

shoyer Mar 31, 2018

max-sixty commented Mar 30, 2018

shoyer Mar 31, 2018

shoyer Mar 31, 2018

max-sixty Mar 31, 2018

shoyer Mar 31, 2018

shoyer Mar 31, 2018

max-sixty Mar 31, 2018

shoyer Mar 31, 2018

shoyer left a comment

max-sixty commented Apr 3, 2018

max-sixty commented Apr 3, 2018

shoyer commented Apr 4, 2018

dcherian commented Apr 4, 2018

shoyer commented Apr 4, 2018

max-sixty commented Apr 4, 2018

Isin #2031

Isin #2031

Conversation

max-sixty commented Mar 30, 2018 • edited Loading

max-sixty commented Mar 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Mar 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Mar 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

max-sixty commented Apr 3, 2018

max-sixty commented Apr 3, 2018

shoyer commented Apr 4, 2018

dcherian commented Apr 4, 2018

shoyer commented Apr 4, 2018

max-sixty commented Apr 4, 2018

max-sixty commented Mar 30, 2018 •

edited

Loading