Add DatetimeAccessor for accessing datetime fields via .dt attribute#1356
Add DatetimeAccessor for accessing datetime fields via .dt attribute#1356shoyer merged 23 commits intopydata:masterfrom
.dt attribute#1356Conversation
xarray/core/accessors.py
Outdated
| def f(self): | ||
| from .dataarray import DataArray | ||
| values = self.dt | ||
| result = libts.get_date_field(values, field) |
There was a problem hiding this comment.
This is using private pandas API, so it's liable to break in the future. Instead, let's wrap things in a pandas.Series. Something like this should do the trick:
def get_dt_field(array: np.ndarray, name: str):
series = pd.Series(array.ravel())
field_values = getattr(series.dt, name).values
return field_values.reshape(array.shape)
xarray/core/accessors.py
Outdated
| ... freq='D', periods=100)}) | ||
| >>> ds.time.dt | ||
| <xarray.core.accessors.DatetimeAccessor at 0x10c369f60> | ||
| >>> ds.time.dt.dayofyear[5] |
There was a problem hiding this comment.
I think this should be [:5], not [5]?
xarray/core/accessors.py
Outdated
|
|
||
| All of the pandas fields are accessible here. Note that these fields are not | ||
| calendar-aware; if your datetimes are encoded with a non-Gregorian calendar | ||
| (e.g. a 360-day calendar), then some fields like `dayofyear` may not be |
There was a problem hiding this comment.
add "using netcdftime" just to be super explicit about where special calendars come from.
xarray/core/accessors.py
Outdated
| def __init__(self, xarray_obj): | ||
| if not is_datetime_like(xarray_obj.dtype): | ||
| raise TypeError("'dt' accessor only available for " | ||
| "DataArray with datetime64 dtype") |
xarray/tests/test_accessors.py
Outdated
| days = self.times.day | ||
| hours = self.times.hour | ||
|
|
||
| self.assertArrayEqual(years, self.data.time.dt.year) |
There was a problem hiding this comment.
Use self.assertDataArrayEqual to check that fields provide data arrays with the appropriate metadata.
xarray/tests/test_accessors.py
Outdated
|
|
||
| def test_not_datetime_type(self): | ||
| nontime_data = self.data.copy() | ||
| nontime_data['time'].values = \ |
There was a problem hiding this comment.
please avoid explicit backlashes for continuation if possible, per PEP8.
xarray/tests/test_accessors.py
Outdated
| nontime_data['time'].values = \ | ||
| np.arange(len(self.data.time)).astype('int8') | ||
| with self.assertRaisesRegexp(TypeError, 'dt'): | ||
| nontime_data.time.dt.year No newline at end of file |
|
@shoyer I corrected things based on your comments. The last commit is an attempt to refactor things to match the way that methods like rolling/groupby functions are injected into the class; this might be totally superfluous here, but I thought it was worth trying. |
xarray/core/accessors.py
Outdated
|
|
||
| @property | ||
| def dt(self): | ||
| """Attribute to cache a view of the underlying datetime-like |
There was a problem hiding this comment.
I don't think we want to expose this as public facing API. So I would get ride of this property, probably.
xarray/core/accessors.py
Outdated
| ('daysinmonth', "The number of days in the month"), | ||
| ] | ||
|
|
||
| def inject_date_field_accessors(cls): |
There was a problem hiding this comment.
This is fine, but I actually did like what you did before -- and we'll probably be switching xarray to use that style in the future, instead of injecting things with a loop. Using a function to define the method directly on the class is much more explicit.
There was a problem hiding this comment.
Cool, I'll revert back to the explicit style.
xarray/core/accessors.py
Outdated
| array for passing to pandas.tslib for date_field operations | ||
| """ | ||
| if self._dt is None: | ||
| self._dt = self._obj.values |
There was a problem hiding this comment.
Something possible worth thinking about, at least a little: What happens if .dt is called on an object that stores its values in dask array? Ideally, we would handle this transparently, by either calling _get_date_field directly (on numpy arrays) or using dask.array.map_blocks to call _get_date_field (on dask arrays).
There was a problem hiding this comment.
Yeah, I see the potential for an issue here, but from some limited testing I'm not sure where it will arise. For instance, consider this contrived example:
import numpy as np
import pandas as pd
import xarray as xr
from dask import array as da
nt = 100000
data = np.random.rand(10, 10, nt)
lons = np.linspace(0, 11, 10)
lats = np.linspace(0, 20, 10)
times = pd.date_range(start="2000/01/01", freq='D', periods=nt)
datest = xr.DataArray(data, coords=[lons, lats, times],
dims=['lon', 'lat', 'time'], name='data')
tc = da.from_array(da2.time.values, chunks=5000)
datest['time2'] = (('time', ), tc)
datest.time2.dt.monthThis works just fine. I tried creating an example where I transform the data underlying datest['time'] into a dask array, but I can't create one that doesn't eagerly load the data into memory, no matter how I try to chunk it. Do dimensions automatically load dask arrays?
Chunking on datest itself was not a problem, either again, because accessing datest.time always gave an in-memory array.
There was a problem hiding this comment.
You can only put a dask array in xarray if it isn't an dimension coordinate, so in practice these are pretty rare for datetime64 variables but they do come up occasionally (e.g., consider a 2D variable time for a collection of weather forecasts with dimensions for reference time and time since the start of the forecast).
There was a problem hiding this comment.
Gotcha. A few more lines of code to add to the example before:
times_arr = np.random.choice(times, size=(10, 10, nt))
dstest = datest.to_dataset()
dstest['all_times'] = (('lon', 'lat', 'time'), times_arr)
dstest = dstest.chunk({'lon': 5, 'lat': 5, 'time': 5000})So now we're definitely holding a dask array:
>>> dstest.all_times
<xarray.DataArray 'all_times' (lon: 10, lat: 10, time: 100000)>
dask.array<xarray-..., shape=(10, 10, 100000), dtype=datetime64[ns], chunksize=(5, 5, 5000)>
Coordinates:
* lon (lon) float64 0.0 1.222 2.444 3.667 4.889 6.111 7.333 8.556 ...
* lat (lat) float64 0.0 2.222 4.444 6.667 8.889 11.11 13.33 15.56 ...
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
time2 (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...but our datetime accessors still work just fine:
>>> dstest.all_times.dt.month
<xarray.DataArray 'month' (lon: 10, lat: 10, time: 100000)>
array([[[ 3, 5, ..., 11, 5],
[ 2, 12, ..., 8, 4],
...,
[ 9, 11, ..., 6, 11],
[ 6, 1, ..., 12, 8]],
...Here I cache the values of the DataArray passed in, and that's what the accessor checks to figure out the requested field. Are those always eagerly evaluated? If so, then it would make sense why everything works out okay - we've stored an array in memory, and we're converting that to a Pandas Series.
Things also work fine if I create a chunked dask array from times_arr above and store it directly in the Dataset.
There was a problem hiding this comment.
Yes, the current version of the accessor works, but only by loading the entire dask array into memory as a NumPy array first. That's a no-no: operations on dask arrays should be lazy or raise an error if that isn't possible..
There was a problem hiding this comment.
Ok I understand now. I'll see what I can do about safely handling dask arrays here; I know there's plenty of examples to emulate across the xarray code base.
There was a problem hiding this comment.
At the very least, call fail_on_dask_array_input: https://github.com/pydata/xarray/blob/master/xarray/core/duck_array_ops.py#L55
Fixed a bug where data wasn't computed in correct order
|
Updated with support for multi-dimensional time data stored as dask array. |
xarray/core/accessors.py
Outdated
| attr_values = getattr(accessor, name).values | ||
| return attr_values.compute() | ||
|
|
||
| def _ravel_and_access(darr): |
There was a problem hiding this comment.
This is more complicated than it needs to, as evidenced by the nested calls to compute. The great thing about map_blocks is that you can make everything eager, e.g.,
def _ravel_and_access(darr):
values_as_series = pd.Series(values.ravel())
field_values = getattr(values_as_series.dt, name).values
return field_values.reshape(values.shape)That's it!
There was a problem hiding this comment.
Wow! The time spent on tackling this project is well worth it just for this tidbit alone. I'll circle back to clean up this stuff within the next day or so.
xarray/core/accessors.py
Outdated
| dtype=raveled.dtype) | ||
| return field_values.reshape(darr.shape).compute() | ||
|
|
||
| return map_blocks(_ravel_and_access, values, dtype=values.dtype) |
There was a problem hiding this comment.
dtype needs to be the dtype of the return value, not the input. For most of these accessors, this is just np.int64 (for strings, use dtype=object).
xarray/tests/test_accessors.py
Outdated
| int_data = np.arange(len(self.data.time)).astype('int8') | ||
| nontime_data['time'].values = int_data | ||
| with self.assertRaisesRegexp(TypeError, 'dt'): | ||
| nontime_data.time.dt.year |
There was a problem hiding this comment.
I would use simply nontime_data.time.dt to make the test a little more stringent.
xarray/tests/test_accessors.py
Outdated
|
|
||
| # Safely pre-compute comparison fields by passing through Pandas | ||
| # machinery | ||
| def _getattr_and_reshape(arr, attr): |
There was a problem hiding this comment.
It's OK if this unit test relies on the correctness of computing date fields on NumPy arrays -- you already check that in other tests. So I would get rid of this and simply rely on these fields already working, e.g., just use self.times_arr.dt.year.
xarray/core/accessors.py
Outdated
| self._obj = xarray_obj | ||
| self._dt = self._obj.data | ||
|
|
||
| _field_ops = ['year', 'month', 'day', 'hour', 'minute', 'second', |
There was a problem hiding this comment.
you don't need this variable anymore.
|
Finished clean-up, added some documentation, etc. I mangled resolving a merge conflict with my update to wrt to the virtual variables, I think some more thinking is necessary so we can come up with a plan of approach. Do we want to deprecate this feature entirely? Do we just want to wrap the datetime component virtual variables to the |
This is a good question, possibly worth raising in a separate issue for discussion. I think they are a convenient shortcut, but they shouldn't be the primary interface.
Yes, certainly. We definitely don't want to have two separate implementations of datetime components in the code base. |
xarray/core/accessors.py
Outdated
| def f(self): | ||
| from .dataarray import DataArray | ||
| result = _get_date_field(self._dt, name) | ||
| return DataArray(result, name=name, |
There was a problem hiding this comment.
can also just do type(self._obj) here instead of DataArray, and the result will probably work better on subclasses
xarray/core/accessors.py
Outdated
| def _tslib_field_accessor(name, docstring=None): | ||
| def f(self): | ||
| from .dataarray import DataArray | ||
| result = _get_date_field(self._dt, name) |
There was a problem hiding this comment.
I'm not really sure why you need self._dt. Why not just use self._obj.data here?
xarray/core/accessors.py
Outdated
| Name of datetime field to access | ||
|
|
||
| """ | ||
| def _access_through_series(values, name): |
There was a problem hiding this comment.
For serialization in dask, it's better to pass top-level functions to map_blocks. So I'd just move this to the top level.
| Array-like container of datetime-like values | ||
| name : str | ||
| Name of datetime field to access | ||
|
|
| ds['time.dayofyear'] | ||
|
|
||
| xarray adds ``'season'`` to the list of datetime components supported by pandas: | ||
| For use as a derived coordinate, xarray adds ``'season'`` to the list of |
There was a problem hiding this comment.
It would be nice to move the logic for season and the other datetime virtual variables to use .dt.
xarray/core/accessors.py
Outdated
| return _access_through_series(values, name) | ||
|
|
||
|
|
||
| @register_dataarray_accessor('dt') |
There was a problem hiding this comment.
Another thought: since this isn't being defined outside of xarray, it might actually make sense to import this class from datarray.py and simply add it to the DataArray class directly:
class DataArray:
...
dt = property(DatatimeAccessor)
That has the advantage of being more transparent about whether it comes from.
Fixed imports to facilitate more direct implementation of DateTimeAccessor as a property in DataArray Moved _access_through_series to a top-level function in accessors.py so that dask serialization will hopefully work a bit better
|
There's a test-case relating to #367 (test_virtual_variable_same_name) which is causing me a bit of grief as I re-factor the virtual variable logic. Should we really be able to access variables like Two options for fixing:
|
This would be my preferred fix. I agree that |
|
Turns out it was easy enough to add an accessor for |
shoyer
left a comment
There was a problem hiding this comment.
Looks good, once you fix a few minor points. You'll need to merge master to get this in "what's new".
xarray/core/dataset.py
Outdated
| data = getattr(ref_var.dt, var_name) | ||
| else: | ||
| data = getattr(date, var_name) | ||
| data = getattr(ref_var) |
There was a problem hiding this comment.
This is broken -- I think this clause should be removed instead.
There was a problem hiding this comment.
I think it's a typo; I intended the whole if-then-else block to read:
if is_datetime_like(ref_var.dtype):
ref_var = xr.DataArray(ref_var)
data = getattr(ref_var.dt, var_name).data
else:
data = getattr(ref_var, var_name).dataWe only access datetime virtual fields for now, but this leaves the possibility of other types of virtual field access at some point in the future.
xarray/core/dataset.py
Outdated
| data = seasons[(month // 3) % 4] | ||
| if is_datetime_like(ref_var.dtype): | ||
| ref_var = xr.DataArray(ref_var) | ||
| data = getattr(ref_var.dt, var_name) |
There was a problem hiding this comment.
probably should use getattr(ref_var.dt, var_name).data to avoid passing a DataArray instance into xarray.Variable.
xarray/core/accessors.py
Outdated
| """ | ||
| if isinstance(values, dask_array_type): | ||
| from dask.array import map_blocks | ||
| out_dtype = object if name == "weekday_name" else np.int64 |
There was a problem hiding this comment.
Should also be object for season and time.
Actually, rather than put the logic here, can you add a dtype argument to _tslib_field_accessor and pass it on?
There was a problem hiding this comment.
Yes - went ahead and did that. dtype now passes through the whole chain of functions, but is only used if we end up calling map_blocks on a dask array.
|
Alrighty, patched and ready for a final look-over! I appreciate the help and patience, @shoyer! |
|
Thanks @darothen! This is really nice |
* Start on implementing and testing NetCDFTimeIndex * TST Move to using pytest fixtures to structure tests * Address initial review comments * Address second round of review comments * Fix failing python3 tests * Match test method name to method name * First attempts at integrating NetCDFTimeIndex into xarray This is a first pass at the following: - Resetting the logic for decoding datetimes such that `np.datetime64` objects are never used for non-standard calendars - Adding logic to use a `NetCDFTimeIndex` whenever `netcdftime.datetime` objects are used in an array being cast as an index (so if one reads in a Dataset from a netCDF file or creates one in Python, which is indexed by a time coordinate that uses `netcdftime.datetime` objects a NetCDFTimeIndex will be used rather than a generic object-based index) - Adding logic to encode `netcdftime.datetime` objects when saving out to netCDF files * Cleanup * Fix DataFrame and Series test failures for NetCDFTimeIndex These were related to a recent minor upstream change in pandas: https://github.com/pandas-dev/pandas/blame/master/pandas/core/indexing.py#L1433 * First pass at making NetCDFTimeIndex compatible with #1356 * Address initial review comments * Restore test_conventions.py * Fix failing test in test_utils.py * flake8 * Update for standalone netcdftime * Address stickler-ci comments * Skip test_format_netcdftime_datetime if netcdftime not installed * A start on documentation * Fix failing zarr tests related to netcdftime encoding * Simplify test_decode_standard_calendar_single_element_non_ns_range * Address a couple review comments * Use else clause in _maybe_cast_to_netcdftimeindex * Start on adding enable_netcdftimeindex option * Continue parametrizing tests in test_coding_times.py * Update time-series.rst for enable_netcdftimeindex option * Use :py:func: in rst for xarray.set_options * Add a what's new entry and test that resample raises a TypeError * Move what's new entry to the version 0.10.3 section * Add version-dependent pathway for importing netcdftime.datetime * Make NetCDFTimeIndex and date decoding/encoding compatible with datetime.datetime * Remove logic to make NetCDFTimeIndex compatible with datetime.datetime * Documentation edits * Ensure proper enable_netcdftimeindex option is used under lazy decoding Prior to this, opening a dataset with enable_netcdftimeindex set to True and then accessing one of its variables outside the context manager would lead to it being decoded with the default enable_netcdftimeindex (which is False). This makes sure that lazy decoding takes into account the context under which it was called. * Add fix and test for concatenating variables with a NetCDFTimeIndex Previously when concatenating variables indexed by a NetCDFTimeIndex the index would be wrongly converted to a generic pd.Index * Further namespace changes due to netcdftime/cftime renaming * NetCDFTimeIndex -> CFTimeIndex * Documentation updates * Only allow use of CFTimeIndex when using the standalone cftime Also only allow for serialization of cftime.datetime objects when using the standalone cftime package. * Fix errant what's new changes * flake8 * Fix skip logic in test_cftimeindex.py * Use only_use_cftime_datetimes option in num2date * Require standalone cftime library for all new functionality Add tests/fixes for dt accessor with cftime datetimes * Improve skipping logic in test_cftimeindex.py * Fix skipping logic in test_cftimeindex.py for when cftime or netcdftime are not available. Use existing requires_cftime decorator where possible (i.e. only on tests that are not parametrized via pytest.mark.parametrize) * Fix skip logic in Python 3.4 build for test_cftimeindex.py * Improve error messages when for when the standalone cftime is not installed * Tweak skip logic in test_accessors.py * flake8 * Address review comments * Temporarily remove cftime from py27 build environment on windows * flake8 * Install cftime via pip for Python 2.7 on Windows * flake8 * Remove unnecessary new lines; simplify _maybe_cast_to_cftimeindex * Restore test case for #2002 in test_coding_times.py I must have inadvertently removed it during a merge. * Tweak dates out of range warning logic slightly to preserve current default * Address review comments
git diff upstream/master | flake8 --diffThis uses the
register_dataarray_accessorto add attributes similar to those in pandastimeseries which let the users quickly access datetime fields from an underlying array of datetime-like values. The referenced issue (#358) also asks about adding similar accessors for.str`, but this is a more complex topic - I think a compelling use-case would help in figuring out what the critical functionality isVirtual time fields
Presumably this could be used to augment
Dataset._get_virtual_variable(). A season field would need to be added as a special field to the accessor.