-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
WIP: sketch of resample support for CFTimeIndex #2458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Example usage: >>> import xarray >>> times = xarray.cftime_range('2000', periods=30, freq='MS') >>> da = xarray.DataArray(range(30), [('time', times)]) >>> da.resample(time='1AS').mean() <xarray.DataArray (time: 3)> array([ 5.5, 17.5, 26.5]) Coordinates: * time (time) object 2001-01-01 00:00:00 ... 2003-01-01 00:00:00
xarray/core/common.py
Outdated
if isinstance(self.indexes[dim], CFTimeIndex): | ||
# TODO: handle closed, label and base arguments, and the case where | ||
# frequency is specified without an integer count. | ||
grouper = self.indexes[dim].shift(n=int(freq[0]), freq=freq[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conveniently, I think this could be written as self.indexes[dim].shift(1, freq)
and it would handle both the case where the frequency was specified with an integer multiple and without (internally in shift
the frequency string is converted to an offset that has the appropriate multiple, so n
can always be 1 in this case):
In [1]: import xarray as xr
In [2]: times = xr.cftime_range('2000', periods=5, freq='D')
In [3]: times
Out[3]:
CFTimeIndex([2000-01-01 00:00:00, 2000-01-02 00:00:00, 2000-01-03 00:00:00,
2000-01-04 00:00:00, 2000-01-05 00:00:00],
dtype='object')
In [4]: times.shift(1, 'D')
Out[4]:
CFTimeIndex([2000-01-02 00:00:00, 2000-01-03 00:00:00, 2000-01-04 00:00:00,
2000-01-05 00:00:00, 2000-01-06 00:00:00],
dtype='object')
In [5]: times.shift(1, '2D')
Out[5]:
CFTimeIndex([2000-01-03 00:00:00, 2000-01-04 00:00:00, 2000-01-05 00:00:00,
2000-01-06 00:00:00, 2000-01-07 00:00:00],
dtype='object')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general though (e.g. for frequency multiples greater than 1), something as simple as shift
will not suffice. I think one will need some sort of binning mechanism like pandas.cut
to bin the dates in regular intervals starting from the start date, with a result from a call to xr.cftime_range
forming the bin edges.
@huard that's all I have for now -- I may add some notes over the weekend if I have time. I have looked into doing this some in the past, but I didn't arrive at something I was totally pleased with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spencerkclark Thanks! I'm confused with how groupby on grouper actually works in the example above. I'll take a look at cut
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's how the magic of grouper
works:
Lines 230 to 244 in 0f70a87
if grouper is not None: | |
index = safe_cast_to_index(group) | |
if not index.is_monotonic: | |
# TODO: sort instead of raising an error | |
raise ValueError('index must be monotonic for resampling') | |
s = pd.Series(np.arange(index.size), index) | |
first_items = s.groupby(grouper).first() | |
full_index = first_items.index | |
if first_items.isnull().any(): | |
first_items = first_items.dropna() | |
sbins = first_items.values.astype(np.int64) | |
group_indices = ([slice(i, j) | |
for i, j in zip(sbins[:-1], sbins[1:])] + | |
[slice(sbins[-1], None)]) | |
unique_coord = IndexVariable(group.name, first_items.index) |
Basically, it's just used as the variable over which the groupby operation is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the line first_items = s.groupby(grouper).first()
that feels magic to me. How does groubpy
know to split into years, when grouper
is just a shifted monthly index ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it clearly doesn't work with the shifted monthly index, as Spencer pointed out :).
Inspired by @spencerkclark's suggestion, I tried another version based on In [7]: times = xarray.cftime_range('2000', periods=30, freq='MS')
In [8]: da = xarray.DataArray(range(30), [('time', times)])
In [9]: times
Out[9]:
CFTimeIndex([2000-01-01 00:00:00, 2000-02-01 00:00:00, 2000-03-01 00:00:00,
2000-04-01 00:00:00, 2000-05-01 00:00:00, 2000-06-01 00:00:00,
2000-07-01 00:00:00, 2000-08-01 00:00:00, 2000-09-01 00:00:00,
2000-10-01 00:00:00, 2000-11-01 00:00:00, 2000-12-01 00:00:00,
2001-01-01 00:00:00, 2001-02-01 00:00:00, 2001-03-01 00:00:00,
2001-04-01 00:00:00, 2001-05-01 00:00:00, 2001-06-01 00:00:00,
2001-07-01 00:00:00, 2001-08-01 00:00:00, 2001-09-01 00:00:00,
2001-10-01 00:00:00, 2001-11-01 00:00:00, 2001-12-01 00:00:00,
2002-01-01 00:00:00, 2002-02-01 00:00:00, 2002-03-01 00:00:00,
2002-04-01 00:00:00, 2002-05-01 00:00:00, 2002-06-01 00:00:00],
dtype='object')
In [10]: da.resample(time='12MS').mean()
Out[10]:
<xarray.DataArray (time: 3)>
array([ 5.5, 17.5, 26.5])
Coordinates:
* time (time) object 2000-01-01 00:00:00 ... 2002-01-01 00:00:00
In [11]: da.resample(time='6MS').mean()
Out[11]:
<xarray.DataArray (time: 5)>
array([ 2.5, 8.5, 14.5, 20.5, 26.5])
Coordinates:
* time (time) object 2000-01-01 00:00:00 ... 2002-01-01 00:00:00
In [12]: da.resample(time='3MS').mean()
Out[12]:
<xarray.DataArray (time: 10)>
array([ 1., 4., 7., 10., 13., 16., 19., 22., 25., 28.])
Coordinates:
* time (time) object 2000-01-01 00:00:00 ... 2002-04-01 00:00:00 |
Do you think there would be a benefit to implementing a TimeGrouper class based on panda's ? |
My instinct would be to first pursue the simple approach that @shoyer has started here. If it turns out that passing a As of yet, while there are a few details that need to be added to Stephan's implementation (e.g., as he notes in the to-do comment, proper handling of the |
I never saw clear use-cases for TimeGrouper but I could be convinced.
…On Thu, Oct 4, 2018 at 2:51 PM David Huard ***@***.***> wrote:
Do you think there would be a benefit to implementing a TimeGrouper class
based on panda's ?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2458 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1rj2dc2Fk1el8e5kjU5VQGs5759jks5uhgRKgaJpZM4XEUir>
.
|
Implemented in #2593 |
Example usage: