Skip to content

Isin #2031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Apr 4, 2018
Merged

Isin #2031

merged 25 commits into from
Apr 4, 2018

Conversation

max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Mar 30, 2018

  • Tests added (for all bug fixes or enhancements)
  • Tests passed (for all non-documentation changes)
  • Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

This is an initial implementation of isin. It works for DataArrays, but isn't yet implemented further.

I've been away from these sorts features for too long: Is there a canonical place to put these? What's the canonical approach to extending to datasets? I can see a few approaches in the code.

@max-sixty
Copy link
Collaborator Author

Fails on Numpy pre 1.13. Is that too recent to upgrade min version? 1.14.2 is current, so would be aggressive

@@ -2113,6 +2113,15 @@ def rank(self, dim, pct=False, keep_attrs=False):
ds = self._to_temp_dataset().rank(dim, pct=pct, keep_attrs=keep_attrs)
return self._from_temp_dataset(ds)

def isin(self, test_elements):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could put this in xarray/core/common.py, maybe on DataWithCoords?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Anyone recall the newest way to run across Variables in a Dataset (i.e. more standard than a loop)?

I'm sure I saw it in a recent PR but can't find it, either in the code or the recent PRs

Otherwise loop is fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply_ufunc can handle an xarray.Dataset as an argument.

Are you thinking of using a Dataset for test_elements as well?

np.isin,
self,
kwargs=dict(test_elements=test_elements),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding dask='parallelized', output_dtypes=[np.bool_] to make this work with dask.

@shoyer
Copy link
Member

shoyer commented Mar 30, 2018

Let's just skip the tests if numpy is too old.

@@ -2,6 +2,7 @@

import functools
import warnings
from distutils.version import LooseVersion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 'distutils.version.LooseVersion' imported but unused

.sel(dim2=0, dim3='a')
.isel(dim1=[0, 1])
.drop(['time', 'dim3', 'dim2', 'numbers'])
.squeeze()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to feedback on this formatting...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather just construct the desired dataset directly rather than jumping through hoops to reuse the fixture data, e.g.,

result = Dataset(
    data_vars={
        'var1': (('dim1',), [0, 1]),
        'var2': (('dim1',), [1, 2]),
        'var3': (('dim1',), [0, 1]),
    }
).isin([1, 2])

(In general, it's better if tests have less logic.)

@max-sixty
Copy link
Collaborator Author

Any thoughts on this approach of writing out the result on a slice of a sample dataset / dataarray?

I've been thinking about expect tests, as described by @yminsky here.
That would be something like:

  • Have some example datasets (similar to what we do now, though with a well known seed)
  • Run our functions and save to a file, as a known good output
  • During tests, compare the result to the known good output
  • Where different, raise and show the diff

That's a bit harder with numerical data than with small lists of words (the example in the link), but also helpful - we don't have to manually construct the result in python - just check the first time & commit the result. And would enable tests across moderately sized data, rather than only 'toy' examples.

max-sixty and others added 2 commits March 30, 2018 19:33
Makes `np.asarray(dataset)` issue an informative error. Currently,
`np.asarray(xr.Dataset({'x': 0}))` raises `KeyError: 0`, which makes no sense.
.sel(dim2=0, dim3='a')
.isel(dim1=[0, 1])
.drop(['time', 'dim3', 'dim2', 'numbers'])
.squeeze()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather just construct the desired dataset directly rather than jumping through hoops to reuse the fixture data, e.g.,

result = Dataset(
    data_vars={
        'var1': (('dim1',), [0, 1]),
        'var2': (('dim1',), [1, 2]),
        'var3': (('dim1',), [0, 1]),
    }
).isin([1, 2])

(In general, it's better if tests have less logic.)

).astype('bool')
result = da.isin([2, 3]).sel(y=list('de'), z=0)
assert_equal(result, expected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another test for the dask path, e.g., that calls .chunk() on the input and verifies it gives the same result after .compute()? That will need a skipif for dask.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done in test_dataset, lmk if better to put in both)

return apply_ufunc(
np.isin,
self,
kwargs=dict(test_elements=test_elements),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a better idea to explicitly unwrap .data from test_elements if it's an xarray object, and explicitly raise for xarray.Dataset. Otherwise numpy will probably give a really strange error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, see #2032 for what converting a Dataset to a numpy array does.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged your branch - is that better than an additional check?

Why extract .data - won't the standard machinery take care of that? I added a test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you're right, the standard machinery will work fine here for now. In the future when dask supports isin (dask/dask#3363) we'll want to use .data so we can keep it as a dask array.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now! I just merged #2032 separately, so after merging master this should be good to go in.

@max-sixty
Copy link
Collaborator Author

Green! @shoyer

@max-sixty
Copy link
Collaborator Author

I'll merge this later tonight given @shoyer 's previous approval, unless there's any feedback

@shoyer shoyer merged commit a5f7d6a into pydata:master Apr 4, 2018
@shoyer
Copy link
Member

shoyer commented Apr 4, 2018

Thanks @maxim-lian.

As a follow-up, it might be nice to include an example showing how to use this for indexing non-dimensions in the narrative docs somewhere -- maybe in the section on where?

@dcherian
Copy link
Contributor

dcherian commented Apr 4, 2018

@shoyer the cookbook might be a good place for that

@shoyer
Copy link
Member

shoyer commented Apr 4, 2018

Indeed, but combined where/isin is also basically equivalent to indexing, so I think it's appropriate on that doc page, too.

@max-sixty max-sixty deleted the isin branch April 4, 2018 03:51
@max-sixty
Copy link
Collaborator Author

Yes good idea. I'll add that to my (metaphorical) list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants