-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Refactor nanops #2236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor nanops #2236
Conversation
xarray/core/nanops.py
Outdated
min_count = kwds.get('min_count', 1) | ||
|
||
if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK | ||
and not isinstance(axis, tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/core/nanops.py
Outdated
|
||
if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK | ||
and not isinstance(axis, tuple) | ||
and values.dtype.kind in 'uifc' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/core/nanops.py
Outdated
if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK | ||
and not isinstance(axis, tuple) | ||
and values.dtype.kind in 'uifc' | ||
and values.dtype.isnative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/core/nanops.py
Outdated
and not isinstance(axis, tuple) | ||
and values.dtype.kind in 'uifc' | ||
and values.dtype.isnative | ||
and (dtype is None or np.dtype(dtype) == values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/core/nanops.py
Outdated
and values.dtype.kind in 'uifc' | ||
and values.dtype.isnative | ||
and (dtype is None or np.dtype(dtype) == values.dtype) | ||
and min_count != 1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W503 line break before binary operator
xarray/tests/test_variable.py
Outdated
with pytest.raises(NotImplementedError): | ||
v.argmax(skipna=True) | ||
v.argmax(skipna=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W293 blank line contains whitespace
@@ -1508,8 +1508,8 @@ def test_reduce_funcs(self): | |||
assert_identical(v.all(dim='x'), Variable([], False)) | |||
|
|||
v = Variable('t', pd.date_range('2000-01-01', periods=3)) | |||
with pytest.raises(NotImplementedError): | |||
v.argmax(skipna=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anyone remember why we did not implement nanargmax
for pd.date_range
? It looks np.nanargmax
takes care of this dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, no idea! :)
Very nice!
I think this is correct -- bottleneck does not speed up non-NaN skipping functions. |
xarray/tests/test_duck_array_ops.py
Outdated
have a sentinel missing value (int) or skipna=True has not been | ||
implemented (object, datetime64 or timedelta64). | ||
min_count : int, default None | ||
The required number of valid values to perform the operation. If fewer than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (87 > 79 characters)
xarray/tests/test_duck_array_ops.py
Outdated
# without min_count | ||
actual = DataArray.mean.__doc__ | ||
expected = dedent("""\ | ||
Reduce this DataArray's data by applying `mean` along some dimension(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (80 > 79 characters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to restructure this a little bit to have two well defined layers:
- A module of bottleneck/numpy functions that act on numpy arrays only.
- A module of functions that act on numpy or dask arrays (or these could be moved into
duck_array_ops
).
xarray/core/duck_array_ops.py
Outdated
if dtype is None: | ||
func = _dask_or_eager_func(name) | ||
else: | ||
func = getattr(np_module, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a little dangerous -- it could silently convert a dask array into a NumPy array. Can we raise an error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a kind of holdover.
Removed.
xarray/core/nanops.py
Outdated
|
||
if mask is not None: | ||
if isinstance(a, dask_array_type): | ||
return dask_array.where(mask, val, a), mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little surprising to me. I would rather use duck_array_ops.where()
here than use explicit type checks.
Likewise, instead of logic above for mask
, use duck_array_ops.isnull()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Could you explain more detail about this idea? |
OK, let me try:
We could implement this with:
|
Hello from me hopefully contributing some needfull things. At first, I would like to comment that I checked out your code. I ran the following code example using a datafile uploaded under the following link:
I have recognized that the min_count option at recent state technically only works for DataArrays and not for DataSets. However, more interesting is the fact that the dimensions are destroyed:
no longitude and latitude survives your operation. If I would use the the sum-operator on the full dataset (where maybe the code was not modified?) I got
|
!!!Correction!!!. The resample-example above gives also a missing lon-lat dimensions in case of unmodified model code. The resulting numbers are the same.
Now I am a little bit puzzled. But ... !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!IT SEEMS TO BE A BUG!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! If I do the resample process using the old nomenclature: |
Okay. Using the old resample-nomenclature I tried to compare the results with your modified code (min_count = 2) and with the tagged version 0.10.7 (no min_count argument). But I am not sure, if this also works. Do you examine the keywords given from the old-resample method?. However, in the comparison I did not saw the nan's I expected over water. |
@shoyer , thanks for the details. I think I understood your idea. |
Thanks, @rpnaut . I will add the test also. Thanks again for the testing :) |
I think this is ready for another review. |
Thanks, @fujiisoup . I have good news and i have bad news. A) Your min_count argument still seems to work only if using the new syntax for resample, i.e. B) Your min_count argument is not allowed for type 'dataset' but only for type 'dataarray'. Thus, I have tested your implementation only on dataarrays. I take the netcdf - array 'TOT_PREC' and try to compute the monthly sum:
So, the xarray-package is still throwing away the dimensions in x- and y-direction. It has nothing to do with any min-count argument. THIS MUST BE A BUG OF XARRAY. The afore-mentioned dimensions only survive using the old-syntax:
Nevertheless, I have started to use your min_count argument only at one point (the x- and y-dimensions do not matter). In that case, your implementation works fine:
|
To wrap it up. Your implementation works for timeseries - data. There is something strange with time-space data, which should be fixed. If this is fixed, it is worth to test in my evaluation environment. And maybe, it would be interesting to have in the future the min_count argument also available for the old syntax and not only the new. The reason: The dimension name is not flexible anymore - it cannot be a variable like dim=${dim} |
Thanks for testing.
I think it is not true. |
xarray/tests/test_dataset.py
Outdated
|
||
actual = ds.resample(time='1D').sum(min_count=1) | ||
expected = xr.concat([ | ||
ds.isel(time=slice(i*4, (i+1)*4)).sum('time', min_count=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E226 missing whitespace around arithmetic operator
I noticed |
Can anyone give further review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments -- generally this looks like a nice improvement, though!
xarray/core/nanops.py
Outdated
type """ | ||
valid_count = count(value, axis=axis) | ||
value = fillna(value, fill_value) | ||
data = getattr(np, func)(value, axis=axis, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is doing eager evaluation by calling the numpy function -- can we call the dask function instead on dask arrays?
Alternatively, we could make this function error on dask arrays? I just don't want to silently compute them.
xarray/core/nanops.py
Outdated
valid_count = count(value, axis=axis) | ||
value = fillna(value, fill_value) | ||
data = getattr(np, func)(value, axis=axis, **kwargs) | ||
# dask seems return non-integer type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to file an upstream bug report with dask about this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks a misunderstanding or already fixed in upstream.
Removed this line.
xarray/core/nanops.py
Outdated
if isinstance(value, dask_array_type): | ||
data = data.astype(int) | ||
|
||
if (valid_count == 0).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also evaluate dask arrays, which can be costly. I don't know if there's much to do about this but it should be noted.
xarray/core/nanops.py
Outdated
|
||
if isinstance(a, dask_array_type): | ||
return dask_array.nanmax(a, axis=axis) | ||
return nputils.nanmax(a, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be more cleanly written as:
module = nptuils if isinstance(a, dask_array_type) else dask_array
return module.nanmin(a, axis=axis)
xarray/core/nanops.py
Outdated
|
||
if mask is not None: | ||
mask = np.all(mask, axis=axis) | ||
if np.any(mask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more dask friendly to use the.all()
and .any()
methods -- that way at least most of the check can be done in parallel, and only a single value (the boolean scalar) gets evaluates.
xarray/tests/test_duck_array_ops.py
Outdated
if func == 'var': | ||
expected = series_reduce(da, func, skipna=skipna, dim=aggdim, | ||
ddof=0) | ||
assert_allclose(actual, expected, rtol=rtol) | ||
# also check ddof!=0 case | ||
actual = getattr(da, func)(skipna=skipna, dim=aggdim, ddof=5) | ||
if dask: | ||
isinstance(da.data, dask_array_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs assert
?
xarray/tests/test_duck_array_ops.py
Outdated
assert actual.dtype == float | ||
|
||
# without nan | ||
da = construct_dataarray(dim_num, dtype, contains_nan=False, dask=dask) | ||
actual = getattr(da, func)(skipna=skipna) | ||
if dask: | ||
isinstance(da.data, dask_array_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also needs assert
@@ -1508,8 +1508,8 @@ def test_reduce_funcs(self): | |||
assert_identical(v.all(dim='x'), Variable([], False)) | |||
|
|||
v = Variable('t', pd.date_range('2000-01-01', periods=3)) | |||
with pytest.raises(NotImplementedError): | |||
v.argmax(skipna=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, no idea! :)
Thanks, @shoyer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, thanks!
doc/whats-new.rst
Outdated
- Follow up the renamings in dask; from dask.ghost to dask.overlap | ||
By `Keisuke Fujii <https://github.com/fujiisoup>`_. | ||
|
||
By `Tony Tung <https://github.com/ttung>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should reference the line above, not your change here
Thanks for the review. Merging. |
Hi, Edit:
|
Thanks, @st-bender, for the bug report. |
whats-new.rst
for all changes andapi.rst
for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)In #2230, the addition of
min_count
keywords for our reduction methods was discussed, butour
duck_array_ops
module is becoming messy (mainly due to nan-aggregation methods for dask, bottleneck and numpy) and it looks a little hard to maintain them.I tried to refactor them by moving nan-aggregation methods to
nanops
module.I think I still need to take care of more edge cases, but I appreciate any comment for the current implementation.
Note:
In my implementation, bottleneck is not used when
skipna=False
.bottleneck would be advantageous when
skipna=True
as numpy needs to copy the entire array once, but I think numpy's method is still OK ifskipna=False
.