-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Groupby-like API for resampling #1272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
xarray/core/common.py
Outdated
".resample({dim}=\"{freq}\").{how}() ".format( | ||
dim=dim, freq=freq, how=how | ||
), DeprecationWarning, stacklevel=3) | ||
|
||
if isinstance(dim, basestring): | ||
dim = self[dim] | ||
group = DataArray(dim, [(RESAMPLE_DIM, dim)], name=RESAMPLE_DIM) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I think this should probably be group = DataArray(dim, [(dim.dims, dim)], name=RESAMPLE_DIM)
but for some reason this works.
closed, label, base, keep_attrs): | ||
"""Implement the original version of .resample() which immediately | ||
executes the desired resampling operation. """ | ||
from .dataarray import DataArray | ||
RESAMPLE_DIM = '__resample_dim__' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be better way to handle this, but currently using RESAMPLE_DIM
is actually pretty critical here, and removing it is probably the source of some of the bugs you are seeing with your new implementation. The problem is we need to keep track of the resampled time variable in a dataset along with the original time variable. If we give the resampled group variable the same name, then xarray gets them mixed up.
Smoothed out most of the problems from earlier and missing details. Still not sure if it's wise to refactor most of the resampling logic into a new resample.py, like what was done with rolling, but it still makes some sense to keep things in groupby.py because we're just subclassing existing machinery from there. The only issue now is the signature for init() in |
Yes, use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to think about valid uses (and add some test coverage) for groupby-like resampling that doesn't use aggregation. For example, what happens if you want to use resample with arithmetic, .apply
or to iterate over groups? This means that the ugly '__resample_dim__'
name will leak out into the external API.
We still need a way to deal with redundant dimension/coordinate names, but maybe we can give a slightly friendlier name for the coordinate, e.g., resampled_time
if time
is the name of the resampled coordinate.
xarray/core/groupby.py
Outdated
**kwargs) | ||
result = self.apply(reduce_array, shortcut=shortcut) | ||
|
||
return result.rename({self._resample_dim: self._dim}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put this .rename({self._resample_dim: self._dim})
business at the end of .apply
instead? That would give a slightly better result for non-reduce uses of the new resample (e.g., for arithmetic).
xarray/tests/test_dataarray.py
Outdated
with self.assertRaisesRegexp(ValueError, 'index must be monotonic'): | ||
array[[2, 0, 1]].resample(time='1D') | ||
|
||
def test_resample_old_api(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be better to remove some of these old API tests in preference to tests using the new API. We don't really need all of them to be sure the old API still works.
xarray/tests/test_dataarray.py
Outdated
array = DataArray(np.ones(10), [('time', times)]) | ||
|
||
# Simple mean | ||
old_mean = array.resample('1D', 'time', how='mean') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pytest.warns
around each use of the old API to verify that the right warning is raised (and also to ensure the warnings don't get issued when we run the test suite).
doc/time-series.rst
Outdated
|
||
ds.resample(time='6H').reduce(np.mean) | ||
|
||
Resampling does not yet work for upsampling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to fix this before merging this PR, since it suggests the existing functionality would only exist in deprecated form. Pandas does this with a method called .asfreq
, though this is basically pure sugar since in practice I think it works exactly the same as .first
(or .mean
if only doing pure upsampling).
Thanks for the feedback, @shoyer! Will circle back around to continue
working on this in a few days when I have some free time.
- Daniel
|
Should As written, non-aggregation ("transformation"?) doesn't work because the call in |
I think this is mostly because TimeGrouper has been around far longer than non-aggregating resample. |
@darothen - can you give a summary of what's left to do here? |
I think a pull against the new releases is critical to see what breaks. Beyond that, just code clean up and testing. I can try to bump this higher on my priority list. |
…setResample" subclasses of their equivalent "GroupBy" implementation
…actual resampling dimension name
2995ad7
to
3a05c50
Compare
I did my best to re-base everything to master... plan on spending an hour or so figuring out what's broken and at least restoring the status quo. |
Un-do _combine temporary debugging output
TODO
Alright @jhamman, here's the complete list of work left here. I'll tackle some of it during my commutes this week. |
for how in ['mean', 'sum', 'first', 'last', ]: | ||
method = getattr(actual, how) | ||
result = method() | ||
self.assertDatasetEqual(expected, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to change, but in future you can use pytest parameters for these sorts of loops, and assert_equal
rather than the self...
methods throughout
7a767d8
to
31e5510
Compare
@shoyer fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good to go! I'll merge this in a day or two once other have had the chance to look at it over.
xarray/core/resample.py
Outdated
"variable '{}', but it is a dask array; dask arrays not " | ||
"yet supprted in resample.interpolate(). Load into " | ||
"memory with Dataset.load() before resampling." | ||
.format(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it -- thanks!
---------- | ||
kind : str {'linear', 'nearest', 'zero', 'slinear', | ||
'quadratic', 'cubic'} | ||
Interpolation scheme to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this another pass through. I found a number of pep8 violations (run git diff upstream/master | flake8 --diff
) that we should fix. One comment of substance: we probably need better test coverage on the new errors.
- Passes
git diff upstream/master | flake8 --diff
xarray/core/common.py
Outdated
label=label, base=base) | ||
resampler = self._resample_cls(self, group=group, dim=dim_name, | ||
grouper=time_grouper, | ||
resample_dim=resample_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indentation
xarray/core/common.py
Outdated
if isinstance(dim, basestring): | ||
dim_name = dim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dim_name
is never used
xarray/core/common.py
Outdated
else: | ||
result = f(dim=dim.name, skipna=skipna, keep_attrs=keep_attrs) | ||
else: | ||
result = gb.reduce(how, dim=dim.name, keep_attrs=keep_attrs) | ||
result = result.rename({RESAMPLE_DIM: dim.name}) | ||
return result | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: remove extra blank line
xarray/core/dataarray.py
Outdated
@@ -34,7 +35,7 @@ def _infer_coords_and_dims(shape, coords, dims): | |||
"""All the logic for creating a new DataArray""" | |||
|
|||
if (coords is not None and not utils.is_dict_like(coords) and | |||
len(coords) != len(shape)): | |||
len(coords) != len(shape)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix indentation
xarray/core/resample.py
Outdated
"yet supprted in resample.interpolate(). Load into " | ||
"memory with Dataset.load() before resampling." | ||
.format(name) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a unit test that covers this TypeError
? Right now, I think you'll actually get a NameError
because name
is not defined.
xarray/core/resample.py
Outdated
return DataArray(f(new_x), coords, dims, name=dummy.name, | ||
attrs=dummy.attrs) | ||
|
||
ops.inject_reduce_methods(DataArrayResample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: add a second empty line after class definition
xarray/tests/test_dataarray.py
Outdated
# convention from old api | ||
new_api = getattr(resampler, method)(keep_attrs=False) | ||
with pytest.warns(DeprecationWarning): | ||
old_api = array.resample('1D', dim='time', how=method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
xarray/tests/test_dataarray.py
Outdated
def test_resample_mean_keep_attrs(self): | ||
array = DataArray(np.arange(10), [('__resample_dim__', times)]) | ||
actual = array.resample('1D', dim='__resample_dim__', how='first') | ||
self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be:
with self.assertRaisesRegexp(ValueError, 'Proxy resampling dimension'):
actual = array.resample('1D', dim='__resample_dim__', how='first')
I'm not sure why this is passing actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was butchered in more ways than one; I fixed the test which revealed some underlying flaws, too, which are also fixed.
xarray/tests/test_dataarray.py
Outdated
ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y')) | ||
tcoord = DataArray(tt, {'time': times}, ('time', )) | ||
ds = Dataset({'data': array, 'xc': xcoord, | ||
'yc': ycoord, 'tc': tcoord}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
xarray/tests/test_dataset.py
Outdated
ycoord = DataArray(yy.T, {'x': xs, 'y': ys}, ('x', 'y')) | ||
tcoord = DataArray(tt, {'time': times}, ('time', )) | ||
ds = Dataset({'data': array, 'xc': xcoord, | ||
'yc': ycoord, 'tc': tcoord}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation
@jhamman Gotcha, I'll clean everything up by the end of the week. If that's going to block 0.10.0, let me know and I'll shuffle some things around to prioritize this. |
@darothen - we have a few other PRs to wrap up for 0.10 so end of the week is okay. |
@jhamman Think we're good. I deferred 4 small pep8 issues because they're in parts of the codebase which I don't think I ever touched, and i'm worried they're going to screw up the merge. |
Thanks @darothen - can you resolve the merge conflicts? |
@jhamman done - caught me right while I was compiling GEOS-Chem, and the merge conflicts were very simple. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two more changes that need to be made...These were caught by travis.
xarray/tests/test_dataarray.py
Outdated
('x', 'y', 'time')) | ||
self.assertDataArrayIdentical(expected, actual) | ||
|
||
def test_upsample_interpolate(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a @requires_scipy
decorator or put
interp1d = pytest.importorskip('scipy.interpolate.interp1d')
in the first line of the test (instead of the import)
xarray/core/resample.py
Outdated
def _interpolate(self, kind='linear'): | ||
"""Apply scipy.interpolate.interp1d along resampling dimension.""" | ||
from .dataarray import DataArray | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's import interp1d
here, instead of at the module level. scipy
is still a optional dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha - it needed to be imported under DatasetResample
, too.
@darothen - almost there. Two or three more dependency conflicts in the tests. |
@jhamman Ohhh i totally misunderstood the last readout from travis-ci. Dealing with the scipy dependency is easy enough. Nevermind, easy solution is just to use other axis-reversal methods :) |
We just bumped numpy to 1.11, but 1.12 would be too new. Let's just add a |
In it goes. Thanks @darothen! |
This is a work-in-progress to resolve #1269.
Openly welcome feedback/critiques on how I approached this. Subclassing
Data{Array/set}GroupBy
may not be the best way, but it would be easy enough to re-write the necessary helper functions (justapply()
, I think) so that we do not need to inherit form them directly. Additional issues I'm working to resolve:_resample_immediately()
. This may not be the best approach!..._old_api
; these could trivially be placed into their related test cases for the new API.Dataset.resample()
. Oddly enough, if I hard-code keep_attrs=True insidereduce_array()
inDatasetResample::reduce
it works just fine. I haven't figured out where the kwarg is getting lost.test_resample_old_vs_new_api
) fail because the resampling by callingself.groupby_cls
ends up not working - it crashes because the group sizes that get computed are not what it expects. Occurs with both new and old API