-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add DatetimeAccessor for accessing datetime fields via .dt
attribute
#1356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic, thanks!
xarray/core/accessors.py
Outdated
def f(self): | ||
from .dataarray import DataArray | ||
values = self.dt | ||
result = libts.get_date_field(values, field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using private pandas API, so it's liable to break in the future. Instead, let's wrap things in a pandas.Series
. Something like this should do the trick:
def get_dt_field(array: np.ndarray, name: str):
series = pd.Series(array.ravel())
field_values = getattr(series.dt, name).values
return field_values.reshape(array.shape)
xarray/core/accessors.py
Outdated
... freq='D', periods=100)}) | ||
>>> ds.time.dt | ||
<xarray.core.accessors.DatetimeAccessor at 0x10c369f60> | ||
>>> ds.time.dt.dayofyear[5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be [:5]
, not [5]
?
xarray/core/accessors.py
Outdated
|
||
All of the pandas fields are accessible here. Note that these fields are not | ||
calendar-aware; if your datetimes are encoded with a non-Gregorian calendar | ||
(e.g. a 360-day calendar), then some fields like `dayofyear` may not be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "using netcdftime" just to be super explicit about where special calendars come from.
xarray/core/accessors.py
Outdated
def __init__(self, xarray_obj): | ||
if not is_datetime_like(xarray_obj.dtype): | ||
raise TypeError("'dt' accessor only available for " | ||
"DataArray with datetime64 dtype") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "or timedelta64 dtype"
xarray/tests/test_accessors.py
Outdated
days = self.times.day | ||
hours = self.times.hour | ||
|
||
self.assertArrayEqual(years, self.data.time.dt.year) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.assertDataArrayEqual
to check that fields provide data arrays with the appropriate metadata.
xarray/tests/test_accessors.py
Outdated
|
||
def test_not_datetime_type(self): | ||
nontime_data = self.data.copy() | ||
nontime_data['time'].values = \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid explicit backlashes for continuation if possible, per PEP8.
xarray/tests/test_accessors.py
Outdated
nontime_data['time'].values = \ | ||
np.arange(len(self.data.time)).astype('int8') | ||
with self.assertRaisesRegexp(TypeError, 'dt'): | ||
nontime_data.time.dt.year |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing newline on last line
@shoyer I corrected things based on your comments. The last commit is an attempt to refactor things to match the way that methods like rolling/groupby functions are injected into the class; this might be totally superfluous here, but I thought it was worth trying. |
xarray/core/accessors.py
Outdated
|
||
@property | ||
def dt(self): | ||
"""Attribute to cache a view of the underlying datetime-like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to expose this as public facing API. So I would get ride of this property, probably.
xarray/core/accessors.py
Outdated
('daysinmonth', "The number of days in the month"), | ||
] | ||
|
||
def inject_date_field_accessors(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but I actually did like what you did before -- and we'll probably be switching xarray to use that style in the future, instead of injecting things with a loop. Using a function to define the method directly on the class is much more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I'll revert back to the explicit style.
xarray/core/accessors.py
Outdated
array for passing to pandas.tslib for date_field operations | ||
""" | ||
if self._dt is None: | ||
self._dt = self._obj.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something possible worth thinking about, at least a little: What happens if .dt
is called on an object that stores its values in dask array? Ideally, we would handle this transparently, by either calling _get_date_field
directly (on numpy arrays) or using dask.array.map_blocks
to call _get_date_field
(on dask arrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see the potential for an issue here, but from some limited testing I'm not sure where it will arise. For instance, consider this contrived example:
import numpy as np
import pandas as pd
import xarray as xr
from dask import array as da
nt = 100000
data = np.random.rand(10, 10, nt)
lons = np.linspace(0, 11, 10)
lats = np.linspace(0, 20, 10)
times = pd.date_range(start="2000/01/01", freq='D', periods=nt)
datest = xr.DataArray(data, coords=[lons, lats, times],
dims=['lon', 'lat', 'time'], name='data')
tc = da.from_array(da2.time.values, chunks=5000)
datest['time2'] = (('time', ), tc)
datest.time2.dt.month
This works just fine. I tried creating an example where I transform the data underlying datest['time']
into a dask array, but I can't create one that doesn't eagerly load the data into memory, no matter how I try to chunk it. Do dimensions automatically load dask arrays?
Chunking on datest
itself was not a problem, either again, because accessing datest.time
always gave an in-memory array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can only put a dask array in xarray if it isn't an dimension coordinate, so in practice these are pretty rare for datetime64 variables but they do come up occasionally (e.g., consider a 2D variable time
for a collection of weather forecasts with dimensions for reference time and time since the start of the forecast).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. A few more lines of code to add to the example before:
times_arr = np.random.choice(times, size=(10, 10, nt))
dstest = datest.to_dataset()
dstest['all_times'] = (('lon', 'lat', 'time'), times_arr)
dstest = dstest.chunk({'lon': 5, 'lat': 5, 'time': 5000})
So now we're definitely holding a dask array:
>>> dstest.all_times
<xarray.DataArray 'all_times' (lon: 10, lat: 10, time: 100000)>
dask.array<xarray-..., shape=(10, 10, 100000), dtype=datetime64[ns], chunksize=(5, 5, 5000)>
Coordinates:
* lon (lon) float64 0.0 1.222 2.444 3.667 4.889 6.111 7.333 8.556 ...
* lat (lat) float64 0.0 2.222 4.444 6.667 8.889 11.11 13.33 15.56 ...
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
time2 (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
but our datetime accessors still work just fine:
>>> dstest.all_times.dt.month
<xarray.DataArray 'month' (lon: 10, lat: 10, time: 100000)>
array([[[ 3, 5, ..., 11, 5],
[ 2, 12, ..., 8, 4],
...,
[ 9, 11, ..., 6, 11],
[ 6, 1, ..., 12, 8]],
...
Here I cache the values
of the DataArray passed in, and that's what the accessor checks to figure out the requested field. Are those always eagerly evaluated? If so, then it would make sense why everything works out okay - we've stored an array in memory, and we're converting that to a Pandas Series.
Things also work fine if I create a chunked dask array from times_arr
above and store it directly in the Dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current version of the accessor works, but only by loading the entire dask array into memory as a NumPy array first. That's a no-no: operations on dask arrays should be lazy or raise an error if that isn't possible..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I understand now. I'll see what I can do about safely handling dask arrays here; I know there's plenty of examples to emulate across the xarray code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least, call fail_on_dask_array_input
: https://github.com/pydata/xarray/blob/master/xarray/core/duck_array_ops.py#L55
Fixed a bug where data wasn't computed in correct order
Updated with support for multi-dimensional time data stored as dask array. |
xarray/core/accessors.py
Outdated
attr_values = getattr(accessor, name).values | ||
return attr_values.compute() | ||
|
||
def _ravel_and_access(darr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more complicated than it needs to, as evidenced by the nested calls to compute. The great thing about map_blocks
is that you can make everything eager, e.g.,
def _ravel_and_access(darr):
values_as_series = pd.Series(values.ravel())
field_values = getattr(values_as_series.dt, name).values
return field_values.reshape(values.shape)
That's it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! The time spent on tackling this project is well worth it just for this tidbit alone. I'll circle back to clean up this stuff within the next day or so.
xarray/core/accessors.py
Outdated
dtype=raveled.dtype) | ||
return field_values.reshape(darr.shape).compute() | ||
|
||
return map_blocks(_ravel_and_access, values, dtype=values.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtype
needs to be the dtype of the return value, not the input. For most of these accessors, this is just np.int64
(for strings, use dtype=object
).
xarray/tests/test_accessors.py
Outdated
int_data = np.arange(len(self.data.time)).astype('int8') | ||
nontime_data['time'].values = int_data | ||
with self.assertRaisesRegexp(TypeError, 'dt'): | ||
nontime_data.time.dt.year |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use simply nontime_data.time.dt
to make the test a little more stringent.
xarray/tests/test_accessors.py
Outdated
|
||
# Safely pre-compute comparison fields by passing through Pandas | ||
# machinery | ||
def _getattr_and_reshape(arr, attr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK if this unit test relies on the correctness of computing date fields on NumPy arrays -- you already check that in other tests. So I would get rid of this and simply rely on these fields already working, e.g., just use self.times_arr.dt.year
.
xarray/core/accessors.py
Outdated
self._obj = xarray_obj | ||
self._dt = self._obj.data | ||
|
||
_field_ops = ['year', 'month', 'day', 'hour', 'minute', 'second', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this variable anymore.
Finished clean-up, added some documentation, etc. I mangled resolving a merge conflict with my update to wrt to the virtual variables, I think some more thinking is necessary so we can come up with a plan of approach. Do we want to deprecate this feature entirely? Do we just want to wrap the datetime component virtual variables to the |
This is a good question, possibly worth raising in a separate issue for discussion. I think they are a convenient shortcut, but they shouldn't be the primary interface.
Yes, certainly. We definitely don't want to have two separate implementations of datetime components in the code base. |
xarray/core/accessors.py
Outdated
def f(self): | ||
from .dataarray import DataArray | ||
result = _get_date_field(self._dt, name) | ||
return DataArray(result, name=name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also just do type(self._obj)
here instead of DataArray
, and the result will probably work better on subclasses
xarray/core/accessors.py
Outdated
def _tslib_field_accessor(name, docstring=None): | ||
def f(self): | ||
from .dataarray import DataArray | ||
result = _get_date_field(self._dt, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure why you need self._dt
. Why not just use self._obj.data
here?
xarray/core/accessors.py
Outdated
Name of datetime field to access | ||
|
||
""" | ||
def _access_through_series(values, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For serialization in dask, it's better to pass top-level functions to map_blocks
. So I'd just move this to the top level.
xarray/core/accessors.py
Outdated
Array-like container of datetime-like values | ||
name : str | ||
Name of datetime field to access | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Returns
@@ -100,7 +115,8 @@ __ http://pandas.pydata.org/pandas-docs/stable/api.html#time-date-components | |||
ds['time.month'] | |||
ds['time.dayofyear'] | |||
|
|||
xarray adds ``'season'`` to the list of datetime components supported by pandas: | |||
For use as a derived coordinate, xarray adds ``'season'`` to the list of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to move the logic for season
and the other datetime virtual variables to use .dt
.
xarray/core/accessors.py
Outdated
return _access_through_series(values, name) | ||
|
||
|
||
@register_dataarray_accessor('dt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thought: since this isn't being defined outside of xarray, it might actually make sense to import this class from datarray.py
and simply add it to the DataArray class directly:
class DataArray:
...
dt = property(DatatimeAccessor)
That has the advantage of being more transparent about whether it comes from.
Fixed imports to facilitate more direct implementation of DateTimeAccessor as a property in DataArray Moved _access_through_series to a top-level function in accessors.py so that dask serialization will hopefully work a bit better
There's a test-case relating to #367 (test_virtual_variable_same_name) which is causing me a bit of grief as I re-factor the virtual variable logic. Should we really be able to access variables like Two options for fixing:
|
This would be my preferred fix. I agree that |
Turns out it was easy enough to add an accessor for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, once you fix a few minor points. You'll need to merge master to get this in "what's new".
xarray/core/dataset.py
Outdated
else: | ||
data = getattr(date, var_name) | ||
data = getattr(ref_var) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is broken -- I think this clause should be removed instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a typo; I intended the whole if-then-else block to read:
if is_datetime_like(ref_var.dtype):
ref_var = xr.DataArray(ref_var)
data = getattr(ref_var.dt, var_name).data
else:
data = getattr(ref_var, var_name).data
We only access datetime virtual fields for now, but this leaves the possibility of other types of virtual field access at some point in the future.
xarray/core/dataset.py
Outdated
data = seasons[(month // 3) % 4] | ||
if is_datetime_like(ref_var.dtype): | ||
ref_var = xr.DataArray(ref_var) | ||
data = getattr(ref_var.dt, var_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably should use getattr(ref_var.dt, var_name).data
to avoid passing a DataArray instance into xarray.Variable
.
xarray/core/accessors.py
Outdated
""" | ||
if isinstance(values, dask_array_type): | ||
from dask.array import map_blocks | ||
out_dtype = object if name == "weekday_name" else np.int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also be object
for season and time.
Actually, rather than put the logic here, can you add a dtype
argument to _tslib_field_accessor
and pass it on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - went ahead and did that. dtype
now passes through the whole chain of functions, but is only used if we end up calling map_blocks
on a dask array.
Alrighty, patched and ready for a final look-over! I appreciate the help and patience, @shoyer! |
Thanks @darothen! This is really nice |
* Start on implementing and testing NetCDFTimeIndex * TST Move to using pytest fixtures to structure tests * Address initial review comments * Address second round of review comments * Fix failing python3 tests * Match test method name to method name * First attempts at integrating NetCDFTimeIndex into xarray This is a first pass at the following: - Resetting the logic for decoding datetimes such that `np.datetime64` objects are never used for non-standard calendars - Adding logic to use a `NetCDFTimeIndex` whenever `netcdftime.datetime` objects are used in an array being cast as an index (so if one reads in a Dataset from a netCDF file or creates one in Python, which is indexed by a time coordinate that uses `netcdftime.datetime` objects a NetCDFTimeIndex will be used rather than a generic object-based index) - Adding logic to encode `netcdftime.datetime` objects when saving out to netCDF files * Cleanup * Fix DataFrame and Series test failures for NetCDFTimeIndex These were related to a recent minor upstream change in pandas: https://github.com/pandas-dev/pandas/blame/master/pandas/core/indexing.py#L1433 * First pass at making NetCDFTimeIndex compatible with #1356 * Address initial review comments * Restore test_conventions.py * Fix failing test in test_utils.py * flake8 * Update for standalone netcdftime * Address stickler-ci comments * Skip test_format_netcdftime_datetime if netcdftime not installed * A start on documentation * Fix failing zarr tests related to netcdftime encoding * Simplify test_decode_standard_calendar_single_element_non_ns_range * Address a couple review comments * Use else clause in _maybe_cast_to_netcdftimeindex * Start on adding enable_netcdftimeindex option * Continue parametrizing tests in test_coding_times.py * Update time-series.rst for enable_netcdftimeindex option * Use :py:func: in rst for xarray.set_options * Add a what's new entry and test that resample raises a TypeError * Move what's new entry to the version 0.10.3 section * Add version-dependent pathway for importing netcdftime.datetime * Make NetCDFTimeIndex and date decoding/encoding compatible with datetime.datetime * Remove logic to make NetCDFTimeIndex compatible with datetime.datetime * Documentation edits * Ensure proper enable_netcdftimeindex option is used under lazy decoding Prior to this, opening a dataset with enable_netcdftimeindex set to True and then accessing one of its variables outside the context manager would lead to it being decoded with the default enable_netcdftimeindex (which is False). This makes sure that lazy decoding takes into account the context under which it was called. * Add fix and test for concatenating variables with a NetCDFTimeIndex Previously when concatenating variables indexed by a NetCDFTimeIndex the index would be wrongly converted to a generic pd.Index * Further namespace changes due to netcdftime/cftime renaming * NetCDFTimeIndex -> CFTimeIndex * Documentation updates * Only allow use of CFTimeIndex when using the standalone cftime Also only allow for serialization of cftime.datetime objects when using the standalone cftime package. * Fix errant what's new changes * flake8 * Fix skip logic in test_cftimeindex.py * Use only_use_cftime_datetimes option in num2date * Require standalone cftime library for all new functionality Add tests/fixes for dt accessor with cftime datetimes * Improve skipping logic in test_cftimeindex.py * Fix skipping logic in test_cftimeindex.py for when cftime or netcdftime are not available. Use existing requires_cftime decorator where possible (i.e. only on tests that are not parametrized via pytest.mark.parametrize) * Fix skip logic in Python 3.4 build for test_cftimeindex.py * Improve error messages when for when the standalone cftime is not installed * Tweak skip logic in test_accessors.py * flake8 * Address review comments * Temporarily remove cftime from py27 build environment on windows * flake8 * Install cftime via pip for Python 2.7 on Windows * flake8 * Remove unnecessary new lines; simplify _maybe_cast_to_cftimeindex * Restore test case for #2002 in test_coding_times.py I must have inadvertently removed it during a merge. * Tweak dates out of range warning logic slightly to preserve current default * Address review comments
git diff upstream/master | flake8 --diff
This uses the
register_dataarray_accessor
to add attributes similar to those in pandastimeseries which let the users quickly access datetime fields from an underlying array of datetime-like values. The referenced issue (#358) also asks about adding similar accessors for
.str`, but this is a more complex topic - I think a compelling use-case would help in figuring out what the critical functionality isVirtual time fields
Presumably this could be used to augment
Dataset._get_virtual_variable()
. A season field would need to be added as a special field to the accessor.