-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix behaviour of min_count in reducing functions #4911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Make sure Dask-backed arrays are not computed. * Check some specific examples give the correct output. * Run membership tests on xarray.core.dtypes.NAT_TYPES
* Fix mask checks in xarray.core.nanops._maybe_null_out to run lazily for Dask-backed arrays. * Change xarray.core.dtypes.NAT_TYPES to a set (it is only used for membership checks). * Add dtypes to NAT_TYPES rather than instances. Previously np.float64 was returning true from `dtype in NAT_TYPES` which resulted in min_count being ignored when reducing over all axes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
- Technically that's a breaking change (
xr.DataArray([1]).sum(min_count=1)
currently has dtypeìnt
while it'sfloat
after this change). I am fine with this, maybe the docstring could be updated? Saying that if you setmin_count
you'll get a float array? - There is a failure in
ubuntu-latest py37-min-all-deps
- looks likenp.where
is not yet lazy in dask 2.9.2 (I am not sure when it becomes lazy, though). You could just skip the test for the appropriate dask version, or add a minversion check for raise_if_dask_computes
We can use the |
even better would be |
* use duck_array_ops.where instead of np.where * add docstring and whatsnew messages about sum/prod on integer arrays with skipna=True and min_count != None now returning a float array.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I get it - np.where
does not dispatch to dask.array.where
in the earlier versions...
Not for this PR but I wonder if xr.DataArray([1]).sum(min_count=2)
should actually return NA
.
It would make it more consistent, the current behaviour is
due to the different dtype defaults for skipna. I guess this would require rework of the dispatch logic in |
I think we can merge this before the weekend, unless someone objects? |
Thanks @bcbnz. I see this is your first PR here - welcome to xarray! |
The first commit modifies existing tests to check Dask-backed arrays are not computed. It also adds some specific checks that the correct result (NaN or a number as appropriate) is returned and some tests for checking membership of
xarray.core.dtypes.NAT_TYPES
. After this commit I get 89 test failures, and they seem to cover the cases reported in #4898.The second commit fixes these failures:
The checks of the nan mask in
xarray.core.nanops._maybe_null_out
are changed to usenp.where
which allows lazy evaluation.Previously,
xarray.core.dtypes.NAT_TYPES
was a tuple of datetime64 and timedelta64 instances; I've changed it to a set of the dtypes of these instances. It is only used for the membership check in_maybe_null_out
so a set seems appropriate. The previous use of instances rather than dtypes caused a bug --np.float64 in NAT_TYPES
returned true even though it only contained datetime64/timedelta64. This meant that reducing operations over all axes (axis=None
or...
) with float64 arrays ignored min_count as the membership check in_maybe_null_out
caused it to be skipped.pre-commit run --all-files
whats-new.rst