-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Isin #2031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Isin #2031
Conversation
Fails on Numpy pre 1.13. Is that too recent to upgrade min version? 1.14.2 is current, so would be aggressive |
xarray/core/dataarray.py
Outdated
@@ -2113,6 +2113,15 @@ def rank(self, dim, pct=False, keep_attrs=False): | |||
ds = self._to_temp_dataset().rank(dim, pct=pct, keep_attrs=keep_attrs) | |||
return self._from_temp_dataset(ds) | |||
|
|||
def isin(self, test_elements): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could put this in xarray/core/common.py
, maybe on DataWithCoords
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Anyone recall the newest way to run across Variables in a Dataset (i.e. more standard than a loop)?
I'm sure I saw it in a recent PR but can't find it, either in the code or the recent PRs
Otherwise loop is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apply_ufunc
can handle an xarray.Dataset as an argument.
Are you thinking of using a Dataset for test_elements as well?
xarray/core/dataarray.py
Outdated
np.isin, | ||
self, | ||
kwargs=dict(test_elements=test_elements), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding dask='parallelized', output_dtypes=[np.bool_]
to make this work with dask.
Let's just skip the tests if numpy is too old. |
xarray/core/dataarray.py
Outdated
@@ -2,6 +2,7 @@ | |||
|
|||
import functools | |||
import warnings | |||
from distutils.version import LooseVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F401 'distutils.version.LooseVersion' imported but unused
xarray/tests/test_dataset.py
Outdated
.sel(dim2=0, dim3='a') | ||
.isel(dim1=[0, 1]) | ||
.drop(['time', 'dim3', 'dim2', 'numbers']) | ||
.squeeze() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to feedback on this formatting...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather just construct the desired dataset directly rather than jumping through hoops to reuse the fixture data, e.g.,
result = Dataset(
data_vars={
'var1': (('dim1',), [0, 1]),
'var2': (('dim1',), [1, 2]),
'var3': (('dim1',), [0, 1]),
}
).isin([1, 2])
(In general, it's better if tests have less logic.)
Any thoughts on this approach of writing out the result on a slice of a sample dataset / dataarray? I've been thinking about expect tests, as described by @yminsky here.
That's a bit harder with numerical data than with small lists of words (the example in the link), but also helpful - we don't have to manually construct the result in python - just check the first time & commit the result. And would enable tests across moderately sized data, rather than only 'toy' examples. |
Makes `np.asarray(dataset)` issue an informative error. Currently, `np.asarray(xr.Dataset({'x': 0}))` raises `KeyError: 0`, which makes no sense.
xarray/tests/test_dataset.py
Outdated
.sel(dim2=0, dim3='a') | ||
.isel(dim1=[0, 1]) | ||
.drop(['time', 'dim3', 'dim2', 'numbers']) | ||
.squeeze() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather just construct the desired dataset directly rather than jumping through hoops to reuse the fixture data, e.g.,
result = Dataset(
data_vars={
'var1': (('dim1',), [0, 1]),
'var2': (('dim1',), [1, 2]),
'var3': (('dim1',), [0, 1]),
}
).isin([1, 2])
(In general, it's better if tests have less logic.)
).astype('bool') | ||
result = da.isin([2, 3]).sel(y=list('de'), z=0) | ||
assert_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add another test for the dask path, e.g., that calls .chunk()
on the input and verifies it gives the same result after .compute()
? That will need a skipif for dask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(done in test_dataset
, lmk if better to put in both)
return apply_ufunc( | ||
np.isin, | ||
self, | ||
kwargs=dict(test_elements=test_elements), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a better idea to explicitly unwrap .data
from test_elements
if it's an xarray object, and explicitly raise for xarray.Dataset
. Otherwise numpy will probably give a really strange error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, see #2032 for what converting a Dataset to a numpy array does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged your branch - is that better than an additional check?
Why extract .data
- won't the standard machinery take care of that? I added a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you're right, the standard machinery will work fine here for now. In the future when dask supports isin (dask/dask#3363) we'll want to use .data
so we can keep it as a dask array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now! I just merged #2032 separately, so after merging master this should be good to go in.
Green! @shoyer |
I'll merge this later tonight given @shoyer 's previous approval, unless there's any feedback |
Thanks @maxim-lian. As a follow-up, it might be nice to include an example showing how to use this for indexing non-dimensions in the narrative docs somewhere -- maybe in the section on |
@shoyer the cookbook might be a good place for that |
Indeed, but combined |
Yes good idea. I'll add that to my (metaphorical) list. |
whats-new.rst
for all changes andapi.rst
for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)This is an initial implementation of
isin
. It works for DataArrays, but isn't yet implemented further.I've been away from these sorts features for too long: Is there a canonical place to put these? What's the canonical approach to extending to datasets? I can see a few approaches in the code.