Skip to content

Refactor nanops #2236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Aug 16, 2018
Merged

Refactor nanops #2236

merged 29 commits into from
Aug 16, 2018

Conversation

fujiisoup
Copy link
Member

@fujiisoup fujiisoup commented Jun 18, 2018

In #2230, the addition of min_count keywords for our reduction methods was discussed, but
our duck_array_ops module is becoming messy (mainly due to nan-aggregation methods for dask, bottleneck and numpy) and it looks a little hard to maintain them.

I tried to refactor them by moving nan-aggregation methods to nanops module.

I think I still need to take care of more edge cases, but I appreciate any comment for the current implementation.

Note:
In my implementation, bottleneck is not used when skipna=False.
bottleneck would be advantageous when skipna=True as numpy needs to copy the entire array once, but I think numpy's method is still OK if skipna=False.

min_count = kwds.get('min_count', 1)

if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK
and not isinstance(axis, tuple)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W503 line break before binary operator


if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK
and not isinstance(axis, tuple)
and values.dtype.kind in 'uifc'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W503 line break before binary operator

if (not isinstance(values, dask_array_type) and _USE_BOTTLENECK
and not isinstance(axis, tuple)
and values.dtype.kind in 'uifc'
and values.dtype.isnative
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W503 line break before binary operator

and not isinstance(axis, tuple)
and values.dtype.kind in 'uifc'
and values.dtype.isnative
and (dtype is None or np.dtype(dtype) == values.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W503 line break before binary operator

and values.dtype.kind in 'uifc'
and values.dtype.isnative
and (dtype is None or np.dtype(dtype) == values.dtype)
and min_count != 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W503 line break before binary operator

with pytest.raises(NotImplementedError):
v.argmax(skipna=True)
v.argmax(skipna=True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W293 blank line contains whitespace

@@ -1508,8 +1508,8 @@ def test_reduce_funcs(self):
assert_identical(v.all(dim='x'), Variable([], False))

v = Variable('t', pd.date_range('2000-01-01', periods=3))
with pytest.raises(NotImplementedError):
v.argmax(skipna=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anyone remember why we did not implement nanargmax for pd.date_range? It looks np.nanargmax takes care of this dtype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, no idea! :)

@shoyer
Copy link
Member

shoyer commented Jun 18, 2018

Very nice!

In my implementation, bottleneck is not used when skipna=False.
bottleneck would be advantageous when skipna=True as numpy needs to copy the entire array once, but I think numpy's method is still OK if skipna=False.

I think this is correct -- bottleneck does not speed up non-NaN skipping functions.

have a sentinel missing value (int) or skipna=True has not been
implemented (object, datetime64 or timedelta64).
min_count : int, default None
The required number of valid values to perform the operation. If fewer than
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (87 > 79 characters)

# without min_count
actual = DataArray.mean.__doc__
expected = dedent("""\
Reduce this DataArray's data by applying `mean` along some dimension(s).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (80 > 79 characters)

@fujiisoup fujiisoup changed the title [WIP] Refactor nanops Refactor nanops Jun 20, 2018
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to restructure this a little bit to have two well defined layers:

  • A module of bottleneck/numpy functions that act on numpy arrays only.
  • A module of functions that act on numpy or dask arrays (or these could be moved into duck_array_ops).

if dtype is None:
func = _dask_or_eager_func(name)
else:
func = getattr(np_module, name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a little dangerous -- it could silently convert a dask array into a NumPy array. Can we raise an error instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a kind of holdover.
Removed.


if mask is not None:
if isinstance(a, dask_array_type):
return dask_array.where(mask, val, a), mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little surprising to me. I would rather use duck_array_ops.where() here than use explicit type checks.

Likewise, instead of logic above for mask, use duck_array_ops.isnull()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@fujiisoup
Copy link
Member Author

I think it would make sense to restructure this a little bit to have two well defined layers:

A module of bottleneck/numpy functions that act on numpy arrays only.
A module of functions that act on numpy or dask arrays (or these could be moved into duck_array_ops).

Could you explain more detail about this idea?

@shoyer
Copy link
Member

shoyer commented Jun 20, 2018

A module of bottleneck/numpy functions that act on numpy arrays only.
A module of functions that act on numpy or dask arrays (or these could be moved into duck_array_ops).

Could you explain more detail about this idea?

OK, let me try:

  1. On numpy arrays, we use bottleneck eqiuvalents of numpy functions when possible because bottleneck is faster than numpy
  2. On dask arrays, we use dask equivalents of numpy functions.
  3. We also want to add some extra features on top of what numpy/dask/bottleneck provide, e.g., handling of min_count

We could implement this with:

  • nputils.nansum() is equivalent to numpy.nansum() but uses bottleneck.nansum() internally instead when possible.
  • duck_array_ops.nansum() uses numpy_nansum() or dask.array.nansum(), based upon the type of the inputs.
  • duck_array_ops.sum() uses numpy.sum() or dask.array.sum(), based upon the type of the inputs.
  • duck_array_ops.sum_with_mincount() adds mincount and skipna support and is used in the Dataset.sum() implementation. Its is written using duck_array_ops.nansum(), duck_array_ops.sum(), duck_array_ops.where() and duck_array_ops.isnull().

@rpnaut
Copy link

rpnaut commented Jun 21, 2018

Hello from me hopefully contributing some needfull things.

At first, I would like to comment that I checked out your code.

I ran the following code example using a datafile uploaded under the following link:
https://swiftbrowser.dkrz.de/public/dkrz_c0725fe8741c474b97f291aac57f268f/GregorMoeller/

import xarray
import matplotlib.pyplot as plt

data = xarray.open_dataset("eObs_gridded_0.22deg_rot_v14.0.TOT_PREC.1950-2016.nc_CutParamTimeUnitCor_FinalEvalGrid")
data
<xarray.Dataset>
Dimensions:       (rlat: 136, rlon: 144, time: 153)
Coordinates:
  * rlon          (rlon) float32 -22.6 -22.38 -22.16 -21.94 -21.72 -21.5 ...
  * rlat          (rlat) float32 -12.54 -12.32 -12.1 -11.88 -11.66 -11.44 ...
  * time          (time) datetime64[ns] 2006-05-01T12:00:00 ...
Data variables:
    rotated_pole  int32 ...
    TOT_PREC      (time, rlat, rlon) float32 ...
Attributes:
    CDI:                       Climate Data Interface version 1.8.0 (http://m...
    Conventions:               CF-1.6
    history:                   Thu Jun 14 12:34:59 2018: cdo -O -s -P 4 remap...
    CDO:                       Climate Data Operators version 1.8.0 (http://m...
    cdo_openmp_thread_number:  4

data_aggreg = data["TOT_PREC"].resample(time="M").sum(min_count=0)
data_aggreg2 = data["TOT_PREC"].resample(time="M").sum(min_count=1)

I have recognized that the min_count option at recent state technically only works for DataArrays and not for DataSets. However, more interesting is the fact that the dimensions are destroyed:

data_aggreg
<xarray.DataArray 'TOT_PREC' (time: 5)>
array([ 551833.25   ,  465640.09375,  328445.90625,  836892.1875 ,  503601.5    ], dtype=float32)
Coordinates:
  * time     (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...

no longitude and latitude survives your operation. If I would use the the sum-operator on the full dataset (where maybe the code was not modified?) I got

>>> data_aggreg = data.resample(time="M").sum()
>>> data_aggreg
<xarray.Dataset>
Dimensions:       (rlat: 136, rlon: 144, time: 5)
Coordinates:
  * time          (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...
  * rlon          (rlon) float32 -22.6 -22.38 -22.16 -21.94 -21.72 -21.5 ...
  * rlat          (rlat) float32 -12.54 -12.32 -12.1 -11.88 -11.66 -11.44 ...
Data variables:
    rotated_pole  (time) int64 1 1 1 1 1
    TOT_PREC      (time, rlat, rlon) float32 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

@rpnaut
Copy link

rpnaut commented Jun 21, 2018

!!!Correction!!!. The resample-example above gives also a missing lon-lat dimensions in case of unmodified model code. The resulting numbers are the same.

data_aggreg = data["TOT_PREC"].resample(time="M").sum()
data_aggreg
<xarray.DataArray 'TOT_PREC' (time: 5)>
array([ 551833.25   ,  465640.09375,  328445.90625,  836892.1875 ,  503601.5    ], dtype=float32)
Coordinates:
  * time     (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...

Now I am a little bit puzzled. But ...

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!IT SEEMS TO BE A BUG!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

If I do the resample process using the old nomenclature:
data_aggreg = data["TOT_PREC"].resample(dim="time",how="sum",freq="M")
it works.
DO we have a bug in xarray?

@rpnaut
Copy link

rpnaut commented Jun 21, 2018

Okay. Using the old resample-nomenclature I tried to compare the results with your modified code (min_count = 2) and with the tagged version 0.10.7 (no min_count argument). But I am not sure, if this also works. Do you examine the keywords given from the old-resample method?.

However, in the comparison I did not saw the nan's I expected over water.

@fujiisoup
Copy link
Member Author

@shoyer , thanks for the details. I think I understood your idea.
This sounds a cleaner solution.
I will update the code again, but it will take some more days (or a week).

@fujiisoup
Copy link
Member Author

Thanks, @rpnaut .
Actually, I'm changing the code also around sum. So it looks my change caused the bug you reported.
I think we do not have a good test coverage around the dataset.reduction.

I will add the test also. Thanks again for the testing :)

@fujiisoup
Copy link
Member Author

I think this is ready for another review.

@rpnaut
Copy link

rpnaut commented Aug 9, 2018

Thanks, @fujiisoup .

I have good news and i have bad news.

A) Your min_count argument still seems to work only if using the new syntax for resample, i.e. data.resample($dim=$freq).sum(). I guess, this is due to the cancellation of the old syntax in the future. Using the old syntax data.resample(dim=$dim,freq=$freq,how=$oper) your code seems to ignore the min_count argument.

B) Your min_count argument is not allowed for type 'dataset' but only for type 'dataarray'.
Starting with the dataset located here: https://swiftbrowser.dkrz.de/public/dkrz_c0725fe8741c474b97f291aac57f268f/GregorMoeller/
I've got the following message:
data.resample(time="M").sum(min_count=1)
TypeError: sum() got an unexpected keyword argument 'min_count'

Thus, I have tested your implementation only on dataarrays. I take the netcdf - array 'TOT_PREC' and try to compute the monthly sum:

In [39]: data = array.open_dataset("eObs_gridded_0.22deg_rot_v14.0.TOT_PREC.1950-2016.nc_CutParamTimeUnitCor_FinalEvalGrid")
In [40]: datamonth = data["TOT_PREC"].resample(time="M").sum()
In [41]: datamonth
Out[41]: 
<xarray.DataArray 'TOT_PREC' (time: 5)>
array([ 551833.25   ,  465640.09375,  328445.90625,  836892.1875 ,  503601.5    ], dtype=float32)
Coordinates:
  time     (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...

So, the xarray-package is still throwing away the dimensions in x- and y-direction. It has nothing to do with any min-count argument. THIS MUST BE A BUG OF XARRAY. The afore-mentioned dimensions only survive using the old-syntax:

In [41]: datamonth = data["TOT_PREC"].resample(dim="time",freq="M",how="sum")
/usr/bin/ipython3:1: FutureWarning: 
.resample() has been modified to defer calculations. Instead of passing 'dim' and how="sum", instead consider using .resample(time="M").sum('time') 
  #!/usr/bin/env python3

In [42]: datamonth
Out[42]: 
<xarray.DataArray 'TOT_PREC' (time: 5, rlat: 136, rlon: 144)>
array([[[  0.      ,   0.      , ...,   0.      ,   0.      ],
        [  0.      ,   0.      , ...,   0.      ,   0.      ],
        ..., 
        [  0.      ,   0.      , ...,  44.900028,  41.400024],
        [  0.      ,   0.      , ...,  49.10001 ,  46.5     ]]], dtype=float32)
Coordinates:
  * time     (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...
  * rlon     (rlon) float32 -22.6 -22.38 -22.16 -21.94 -21.72 -21.5 -21.28 ...
  * rlat     (rlat) float32 -12.54 -12.32 -12.1 -11.88 -11.66 -11.44 -11.22 ...

Nevertheless, I have started to use your min_count argument only at one point (the x- and y-dimensions do not matter). In that case, your implementation works fine:

In [46]: pointdata = data.isel(rlon=10,rlat=10)
In [47]: pointdata["TOT_PREC"]
Out[47]: 
<xarray.DataArray 'TOT_PREC' (time: 153)>
array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
       ...
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,
        nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan], dtype=float32)
Coordinates:
    rlon     float32 -20.4
    rlat     float32 -10.34
  * time     (time) datetime64[ns] 2006-05-01T12:00:00 2006-05-02T12:00:00 ...
Attributes:
    standard_name:  precipitation_amount
    long_name:      Precipitation
    units:          kg m-2
    grid_mapping:   rotated_pole
    cell_methods:   time: sum

In [48]: pointdata["TOT_PREC"].resample(time="M").sum(min_count=1)
Out[48]: 
<xarray.DataArray 'TOT_PREC' (time: 5)>
array([ nan,  nan,  nan,  nan,  nan])
Coordinates:
  * time     (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...
    rlon     float32 -20.4
    rlat     float32 -10.34

In [49]: pointdata["TOT_PREC"].resample(time="M").sum(min_count=0)
Out[49]: 
<xarray.DataArray 'TOT_PREC' (time: 5)>
array([ 0.,  0.,  0.,  0.,  0.], dtype=float32)
Coordinates:
  * time     (time) datetime64[ns] 2006-05-31 2006-06-30 2006-07-31 ...
    rlon     float32 -20.4
    rlat     float32 -10.34

@rpnaut
Copy link

rpnaut commented Aug 9, 2018

To wrap it up. Your implementation works for timeseries - data. There is something strange with time-space data, which should be fixed. If this is fixed, it is worth to test in my evaluation environment.
Do you have a feeling, why the new syntax is giving such strange behaviour? Shall we put the bug onto the issue list?

And maybe, it would be interesting to have in the future the min_count argument also available for the old syntax and not only the new. The reason: The dimension name is not flexible anymore - it cannot be a variable like dim=${dim}

@rpnaut
Copy link

rpnaut commented Aug 10, 2018

Ok. The strange thing with the spatial dimensions is, that the new syntax forces the user to tell exactly, on which dimension the mathematical operator for resampling (like sum) should be applied. The syntax is now data.resample(time="M").sum(dim="time",min_count=1).

That weird - two times giving the dimension. However, doing so the xarray is not summing up for, e.g. the first month, all values he finds in a specific dataaray, but only those values along the dimension time.

AND NOW, the good new is that I got the following picture with your 'min_count=0' and 'min_count=1':
first_monsum_totprec_xarray_mincounteq0
first_monsum_totprec_xarray_mincounteq1

@fujiisoup
Copy link
Member Author

@rpnaut

Thanks for testing.

Your min_count argument is not allowed for type 'dataset' but only for type 'dataarray'.
Starting with the dataset located here:

I think it is not true.
It works on Dataset, but not on resampled object. I will raise a issue for this later.


actual = ds.resample(time='1D').sum(min_count=1)
expected = xr.concat([
ds.isel(time=slice(i*4, (i+1)*4)).sum('time', min_count=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E226 missing whitespace around arithmetic operator

@fujiisoup
Copy link
Member Author

I noticed min_count is working also on resampled object.
Your issue might be for the API of resample.

@fujiisoup
Copy link
Member Author

Can anyone give further review?

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments -- generally this looks like a nice improvement, though!

type """
valid_count = count(value, axis=axis)
value = fillna(value, fill_value)
data = getattr(np, func)(value, axis=axis, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is doing eager evaluation by calling the numpy function -- can we call the dask function instead on dask arrays?

Alternatively, we could make this function error on dask arrays? I just don't want to silently compute them.

valid_count = count(value, axis=axis)
value = fillna(value, fill_value)
data = getattr(np, func)(value, axis=axis, **kwargs)
# dask seems return non-integer type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to file an upstream bug report with dask about this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a misunderstanding or already fixed in upstream.
Removed this line.

if isinstance(value, dask_array_type):
data = data.astype(int)

if (valid_count == 0).any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also evaluate dask arrays, which can be costly. I don't know if there's much to do about this but it should be noted.


if isinstance(a, dask_array_type):
return dask_array.nanmax(a, axis=axis)
return nputils.nanmax(a, axis=axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be more cleanly written as:

    module = nptuils if isinstance(a, dask_array_type) else dask_array
    return module.nanmin(a, axis=axis)


if mask is not None:
mask = np.all(mask, axis=axis)
if np.any(mask):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more dask friendly to use the.all() and .any() methods -- that way at least most of the check can be done in parallel, and only a single value (the boolean scalar) gets evaluates.

if func == 'var':
expected = series_reduce(da, func, skipna=skipna, dim=aggdim,
ddof=0)
assert_allclose(actual, expected, rtol=rtol)
# also check ddof!=0 case
actual = getattr(da, func)(skipna=skipna, dim=aggdim, ddof=5)
if dask:
isinstance(da.data, dask_array_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs assert?

assert actual.dtype == float

# without nan
da = construct_dataarray(dim_num, dtype, contains_nan=False, dask=dask)
actual = getattr(da, func)(skipna=skipna)
if dask:
isinstance(da.data, dask_array_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs assert

@@ -1508,8 +1508,8 @@ def test_reduce_funcs(self):
assert_identical(v.all(dim='x'), Variable([], False))

v = Variable('t', pd.date_range('2000-01-01', periods=3))
with pytest.raises(NotImplementedError):
v.argmax(skipna=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, no idea! :)

@fujiisoup
Copy link
Member Author

Thanks, @shoyer.
All done.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, thanks!

- Follow up the renamings in dask; from dask.ghost to dask.overlap
By `Keisuke Fujii <https://github.com/fujiisoup>`_.

By `Tony Tung <https://github.com/ttung>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should reference the line above, not your change here

@fujiisoup fujiisoup merged commit 0b9ab2d into pydata:master Aug 16, 2018
@fujiisoup
Copy link
Member Author

Thanks for the review. Merging.

@fujiisoup fujiisoup deleted the refactor_nanops branch August 16, 2018 06:59
@st-bender
Copy link
Contributor

st-bender commented Sep 26, 2018

Hi,
just to let you know that .std() does not accept the ddof keyword anymore (it worked in 0.10.8)
Should I open a new bugreport?

Edit:
It fails with:

~/Work/miniconda3/envs/stats/lib/python3.6/site-packages/xarray/core/duck_array_ops.py in f(values, axis, skipna, **kwargs)
    234 
    235         try:
--> 236             return func(values, axis=axis, **kwargs)
    237         except AttributeError:
    238             if isinstance(values, dask_array_type):

TypeError: nanstd() got an unexpected keyword argument 'ddof'

@fujiisoup
Copy link
Member Author

Thanks, @st-bender, for the bug report.
I copied your comment to #2440.

@fujiisoup fujiisoup mentioned this pull request Sep 27, 2018
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants