Skip to content

Fix behaviour of min_count in reducing functions #4911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 19, 2021
Merged

Fix behaviour of min_count in reducing functions #4911

merged 5 commits into from
Feb 19, 2021

Conversation

bcbnz
Copy link
Contributor

@bcbnz bcbnz commented Feb 15, 2021

The first commit modifies existing tests to check Dask-backed arrays are not computed. It also adds some specific checks that the correct result (NaN or a number as appropriate) is returned and some tests for checking membership of xarray.core.dtypes.NAT_TYPES. After this commit I get 89 test failures, and they seem to cover the cases reported in #4898.

The second commit fixes these failures:

  • The checks of the nan mask in xarray.core.nanops._maybe_null_out are changed to use np.where which allows lazy evaluation.

  • Previously, xarray.core.dtypes.NAT_TYPES was a tuple of datetime64 and timedelta64 instances; I've changed it to a set of the dtypes of these instances. It is only used for the membership check in _maybe_null_out so a set seems appropriate. The previous use of instances rather than dtypes caused a bug -- np.float64 in NAT_TYPES returned true even though it only contained datetime64/timedelta64. This meant that reducing operations over all axes (axis=None or ...) with float64 arrays ignored min_count as the membership check in _maybe_null_out caused it to be skipped.

* Make sure Dask-backed arrays are not computed.
* Check some specific examples give the correct output.
* Run membership tests on xarray.core.dtypes.NAT_TYPES
* Fix mask checks in xarray.core.nanops._maybe_null_out to run lazily
  for Dask-backed arrays.

* Change xarray.core.dtypes.NAT_TYPES to a set (it is only used for
  membership checks).

* Add dtypes to NAT_TYPES rather than instances. Previously np.float64
  was returning true from `dtype in NAT_TYPES` which resulted in
  min_count being ignored when reducing over all axes.
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

  • Technically that's a breaking change (xr.DataArray([1]).sum(min_count=1) currently has dtype ìnt while it's float after this change). I am fine with this, maybe the docstring could be updated? Saying that if you set min_count you'll get a float array?
  • There is a failure in ubuntu-latest py37-min-all-deps - looks like np.where is not yet lazy in dask 2.9.2 (I am not sure when it becomes lazy, though). You could just skip the test for the appropriate dask version, or add a minversion check for raise_if_dask_computes

@dcherian
Copy link
Contributor

looks like np.where is not yet lazy in dask 2.9.2

We can use the where_method from duck_array_ops instead of np.where

@keewis
Copy link
Collaborator

keewis commented Feb 16, 2021

We can use the where_method from duck_array_ops instead of np.where

even better would be duck_array_ops.where (it is closer to np.where)

* use duck_array_ops.where instead of np.where
* add docstring and whatsnew messages about sum/prod on integer arrays
  with skipna=True and min_count != None now returning a float array.
@bcbnz
Copy link
Contributor Author

bcbnz commented Feb 16, 2021

  • Changed to use duck_array_ops.where

  • Docstring updated and entry added to breaking changes of whats new (I assumed this was appropriate). Note its a bit more nuanced than @mathause stated: for an integer array, skipna defaults to False and so xarray.core.duck_array_ops._create_nan_agg_method does not call nansum. This means you have to force skipna=True to see the difference:

    • master: xr.DataArray([1]).sum(min_count=1).dtype -> int64
    • this PR: xr.DataArray([1]).sum(min_count=1).dtype -> int64
    • master: xr.DataArray([1]).sum(skipna=True, min_count=1).dtype -> int64
    • this PR: xr.DataArray([1]).sum(skipna=True, min_count=1).dtype -> float64

    Hopefully this is clear in the docstring changes, suggestions welcome.

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I get it - np.where does not dispatch to dask.array.where in the earlier versions...

Not for this PR but I wonder if xr.DataArray([1]).sum(min_count=2) should actually return NA.

@bcbnz
Copy link
Contributor Author

bcbnz commented Feb 16, 2021

Not for this PR but I wonder if xr.DataArray([1]).sum(min_count=2) should actually return NA.

It would make it more consistent, the current behaviour is

  • xr.DataArray([1]).sum(min_count=2) -> 1
  • xr.DataArray([1.0]).sum(min_count=2) -> nan

due to the different dtype defaults for skipna. I guess this would require rework of the dispatch logic in xarray.core.duck_array_ops._create_nan_agg_method to make it call nansum/nanprod when appropriate (e.g., should setting min_count imply skipna=True?).

@dcherian dcherian mentioned this pull request Feb 17, 2021
6 tasks
@mathause
Copy link
Collaborator

I think we can merge this before the weekend, unless someone objects?

@mathause mathause merged commit d61efb6 into pydata:master Feb 19, 2021
@mathause
Copy link
Collaborator

Thanks @bcbnz. I see this is your first PR here - welcome to xarray!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sum and prod with min_count forces evaluation
4 participants