Add DatetimeAccessor for accessing datetime fields via `.dt` attribute #1356

darothen · 2017-04-06T19:48:19Z

Partially closes add .dt and .str accessors to DataArray (like pandas.Series) #358
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

This uses the register_dataarray_accessor to add attributes similar to those in pandastimeseries which let the users quickly access datetime fields from an underlying array of datetime-like values. The referenced issue (#358) also asks about adding similar accessors for.str`, but this is a more complex topic - I think a compelling use-case would help in figuring out what the critical functionality is

Virtual time fields

Presumably this could be used to augment Dataset._get_virtual_variable(). A season field would need to be added as a special field to the accessor.

shoyer

This is fantastic, thanks!

shoyer · 2017-04-07T02:57:56Z

xarray/core/accessors.py

+        def f(self):
+            from .dataarray import DataArray
+            values = self.dt
+            result = libts.get_date_field(values, field)


This is using private pandas API, so it's liable to break in the future. Instead, let's wrap things in a pandas.Series. Something like this should do the trick:

def get_dt_field(array: np.ndarray, name: str): series = pd.Series(array.ravel()) field_values = getattr(series.dt, name).values return field_values.reshape(array.shape)

shoyer · 2017-04-07T02:58:59Z

xarray/core/accessors.py

+        ...                                            freq='D', periods=100)})
+        >>> ds.time.dt
+        <xarray.core.accessors.DatetimeAccessor at 0x10c369f60>
+        >>> ds.time.dt.dayofyear[5]


I think this should be [:5], not [5]?

shoyer · 2017-04-07T02:59:38Z

xarray/core/accessors.py

+
+     All of the pandas fields are accessible here. Note that these fields are not
+     calendar-aware; if your datetimes are encoded with a non-Gregorian calendar
+     (e.g. a 360-day calendar), then some fields like `dayofyear` may not be


add "using netcdftime" just to be super explicit about where special calendars come from.

shoyer · 2017-04-07T03:00:35Z

xarray/core/accessors.py

+    def __init__(self, xarray_obj):
+        if not is_datetime_like(xarray_obj.dtype):
+            raise TypeError("'dt' accessor only available for "
+                            "DataArray with datetime64 dtype")


add "or timedelta64 dtype"

shoyer · 2017-04-07T03:03:42Z

xarray/tests/test_accessors.py

+        days = self.times.day
+        hours = self.times.hour
+
+        self.assertArrayEqual(years, self.data.time.dt.year)


Use self.assertDataArrayEqual to check that fields provide data arrays with the appropriate metadata.

shoyer · 2017-04-07T03:04:23Z

xarray/tests/test_accessors.py

+
+    def test_not_datetime_type(self):
+        nontime_data = self.data.copy()
+        nontime_data['time'].values = \


please avoid explicit backlashes for continuation if possible, per PEP8.

shoyer · 2017-04-07T03:04:29Z

xarray/tests/test_accessors.py

+        nontime_data['time'].values = \
+            np.arange(len(self.data.time)).astype('int8')
+        with self.assertRaisesRegexp(TypeError, 'dt'):
+            nontime_data.time.dt.year


missing newline on last line

darothen · 2017-04-07T15:30:43Z

@shoyer I corrected things based on your comments. The last commit is an attempt to refactor things to match the way that methods like rolling/groupby functions are injected into the class; this might be totally superfluous here, but I thought it was worth trying.

shoyer · 2017-04-07T17:34:18Z

xarray/core/accessors.py

+
+    @property
+    def dt(self):
+        """Attribute to cache a view of the underlying datetime-like


I don't think we want to expose this as public facing API. So I would get ride of this property, probably.

shoyer · 2017-04-07T17:35:36Z

xarray/core/accessors.py

+    ('daysinmonth', "The number of days in the month"),
+]
+
+def inject_date_field_accessors(cls):


This is fine, but I actually did like what you did before -- and we'll probably be switching xarray to use that style in the future, instead of injecting things with a loop. Using a function to define the method directly on the class is much more explicit.

Cool, I'll revert back to the explicit style.

shoyer · 2017-04-07T17:37:31Z

xarray/core/accessors.py

+        array for passing to pandas.tslib for date_field operations
+        """
+        if self._dt is None:
+            self._dt = self._obj.values


Something possible worth thinking about, at least a little: What happens if .dt is called on an object that stores its values in dask array? Ideally, we would handle this transparently, by either calling _get_date_field directly (on numpy arrays) or using dask.array.map_blocks to call _get_date_field (on dask arrays).

Yeah, I see the potential for an issue here, but from some limited testing I'm not sure where it will arise. For instance, consider this contrived example:

import numpy as np import pandas as pd import xarray as xr from dask import array as da nt = 100000 data = np.random.rand(10, 10, nt) lons = np.linspace(0, 11, 10) lats = np.linspace(0, 20, 10) times = pd.date_range(start="2000/01/01", freq='D', periods=nt) datest = xr.DataArray(data, coords=[lons, lats, times], dims=['lon', 'lat', 'time'], name='data') tc = da.from_array(da2.time.values, chunks=5000) datest['time2'] = (('time', ), tc) datest.time2.dt.month

This works just fine. I tried creating an example where I transform the data underlying datest['time'] into a dask array, but I can't create one that doesn't eagerly load the data into memory, no matter how I try to chunk it. Do dimensions automatically load dask arrays?

Chunking on datest itself was not a problem, either again, because accessing datest.time always gave an in-memory array.

You can only put a dask array in xarray if it isn't an dimension coordinate, so in practice these are pretty rare for datetime64 variables but they do come up occasionally (e.g., consider a 2D variable time for a collection of weather forecasts with dimensions for reference time and time since the start of the forecast).

Gotcha. A few more lines of code to add to the example before:

times_arr = np.random.choice(times, size=(10, 10, nt)) dstest = datest.to_dataset() dstest['all_times'] = (('lon', 'lat', 'time'), times_arr) dstest = dstest.chunk({'lon': 5, 'lat': 5, 'time': 5000})

So now we're definitely holding a dask array:

>>> dstest.all_times <xarray.DataArray 'all_times' (lon: 10, lat: 10, time: 100000)> dask.array<xarray-..., shape=(10, 10, 100000), dtype=datetime64[ns], chunksize=(5, 5, 5000)> Coordinates: * lon (lon) float64 0.0 1.222 2.444 3.667 4.889 6.111 7.333 8.556 ... * lat (lat) float64 0.0 2.222 4.444 6.667 8.889 11.11 13.33 15.56 ... * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ... time2 (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...

but our datetime accessors still work just fine:

>>> dstest.all_times.dt.month <xarray.DataArray 'month' (lon: 10, lat: 10, time: 100000)> array([[[ 3, 5, ..., 11, 5], [ 2, 12, ..., 8, 4], ..., [ 9, 11, ..., 6, 11], [ 6, 1, ..., 12, 8]], ...

Here I cache the values of the DataArray passed in, and that's what the accessor checks to figure out the requested field. Are those always eagerly evaluated? If so, then it would make sense why everything works out okay - we've stored an array in memory, and we're converting that to a Pandas Series.

Things also work fine if I create a chunked dask array from times_arr above and store it directly in the Dataset.

Yes, the current version of the accessor works, but only by loading the entire dask array into memory as a NumPy array first. That's a no-no: operations on dask arrays should be lazy or raise an error if that isn't possible..

Ok I understand now. I'll see what I can do about safely handling dask arrays here; I know there's plenty of examples to emulate across the xarray code base.

At the very least, call fail_on_dask_array_input: https://github.com/pydata/xarray/blob/master/xarray/core/duck_array_ops.py#L55

Fixed a bug where data wasn't computed in correct order

darothen · 2017-04-11T14:25:27Z

Updated with support for multi-dimensional time data stored as dask array.

shoyer · 2017-04-11T15:53:31Z

xarray/core/accessors.py

+        attr_values = getattr(accessor, name).values
+        return attr_values.compute()
+
+    def _ravel_and_access(darr):


This is more complicated than it needs to, as evidenced by the nested calls to compute. The great thing about map_blocks is that you can make everything eager, e.g.,

def _ravel_and_access(darr): values_as_series = pd.Series(values.ravel()) field_values = getattr(values_as_series.dt, name).values return field_values.reshape(values.shape)

That's it!

Wow! The time spent on tackling this project is well worth it just for this tidbit alone. I'll circle back to clean up this stuff within the next day or so.

shoyer · 2017-04-11T15:55:08Z

xarray/core/accessors.py

+                                    dtype=raveled.dtype)
+        return field_values.reshape(darr.shape).compute()
+
+    return map_blocks(_ravel_and_access, values, dtype=values.dtype)


dtype needs to be the dtype of the return value, not the input. For most of these accessors, this is just np.int64 (for strings, use dtype=object).

shoyer · 2017-04-11T15:58:18Z

xarray/tests/test_accessors.py

+        int_data = np.arange(len(self.data.time)).astype('int8')
+        nontime_data['time'].values = int_data
+        with self.assertRaisesRegexp(TypeError, 'dt'):
+            nontime_data.time.dt.year


I would use simply nontime_data.time.dt to make the test a little more stringent.

shoyer · 2017-04-11T15:59:35Z

xarray/tests/test_accessors.py

+
+        # Safely pre-compute comparison fields by passing through Pandas
+        # machinery
+        def _getattr_and_reshape(arr, attr):


It's OK if this unit test relies on the correctness of computing date fields on NumPy arrays -- you already check that in other tests. So I would get rid of this and simply rely on these fields already working, e.g., just use self.times_arr.dt.year.

shoyer · 2017-04-11T16:01:19Z

xarray/core/accessors.py

+        self._obj = xarray_obj
+        self._dt = self._obj.data
+
+    _field_ops = ['year', 'month', 'day', 'hour', 'minute', 'second',


you don't need this variable anymore.

…e review cleanups

darothen · 2017-04-13T12:26:24Z

Finished clean-up, added some documentation, etc. I mangled resolving a merge conflict with my update to whats-new.rst (5ae4e08) in terms of the commit text, but other than that I think we're getting closer to finishing this.

wrt to the virtual variables, I think some more thinking is necessary so we can come up with a plan of approach. Do we want to deprecate this feature entirely? Do we just want to wrap the datetime component virtual variables to the .dt accessor if they're datetime-like? We could very easily do the latter for 0.9.3, but maybe we should target a future major release to deprecate the virtual variables and instead encourage adding a few specialized (but commonly-used) accessors to xarray?

shoyer · 2017-04-14T03:18:30Z

wrt to the virtual variables, I think some more thinking is necessary so we can come up with a plan of approach. Do we want to deprecate this feature entirely?

This is a good question, possibly worth raising in a separate issue for discussion. I think they are a convenient shortcut, but they shouldn't be the primary interface.

Do we just want to wrap the datetime component virtual variables to the .dt accessor if they're datetime-like?

Yes, certainly. We definitely don't want to have two separate implementations of datetime components in the code base.

shoyer · 2017-04-14T22:58:09Z

xarray/core/accessors.py

+        def f(self):
+            from .dataarray import DataArray
+            result = _get_date_field(self._dt, name)
+            return DataArray(result, name=name,


can also just do type(self._obj) here instead of DataArray, and the result will probably work better on subclasses

shoyer · 2017-04-14T22:58:27Z

xarray/core/accessors.py

+    def _tslib_field_accessor(name, docstring=None):
+        def f(self):
+            from .dataarray import DataArray
+            result = _get_date_field(self._dt, name)


I'm not really sure why you need self._dt. Why not just use self._obj.data here?

shoyer · 2017-04-14T22:59:46Z

xarray/core/accessors.py

+        Name of datetime field to access
+
+    """
+    def _access_through_series(values, name):


For serialization in dask, it's better to pass top-level functions to map_blocks. So I'd just move this to the top level.

shoyer · 2017-04-14T22:59:52Z

xarray/core/accessors.py

+        Array-like container of datetime-like values
+    name : str
+        Name of datetime field to access
+


Add Returns

shoyer · 2017-04-14T23:00:43Z

doc/time-series.rst

@@ -100,7 +115,8 @@ __ http://pandas.pydata.org/pandas-docs/stable/api.html#time-date-components
    ds['time.month']
    ds['time.dayofyear']

-xarray adds ``'season'`` to the list of datetime components supported by pandas:
+For use as a derived coordinate, xarray adds ``'season'`` to the list of


It would be nice to move the logic for season and the other datetime virtual variables to use .dt.

shoyer · 2017-04-14T23:02:25Z

xarray/core/accessors.py

+        return _access_through_series(values, name)
+
+
+@register_dataarray_accessor('dt')


Another thought: since this isn't being defined outside of xarray, it might actually make sense to import this class from datarray.py and simply add it to the DataArray class directly:

class DataArray: ... dt = property(DatatimeAccessor)

That has the advantage of being more transparent about whether it comes from.

Fixed imports to facilitate more direct implementation of DateTimeAccessor as a property in DataArray Moved _access_through_series to a top-level function in accessors.py so that dask serialization will hopefully work a bit better

darothen · 2017-04-17T16:21:19Z

There's a test-case relating to #367 (test_virtual_variable_same_name) which is causing me a bit of grief as I re-factor the virtual variable logic. Should we really be able to access variables like ds['time.time']? This seems to break the logic of what a virtual variable does, and was implemented to help out with time GroupBys and resampling (something I'll eventually get around to finishing up a refactor for - #1272).

Two options for fixing:

add a "time" field to DateTimeAccessor.
add some additional if-then-else logic to _get_virtual_variable.

shoyer · 2017-04-17T21:15:16Z

add a "time" field to DateTimeAccessor.

This would be my preferred fix.

I agree that ds['time.time'] is a little confusing but think it's unambiguous: it says get the "time" component (an array with datetime.time elements) from the "time" variable (an array of dtype=datetime64), i.e., equivalent to df['time'].dt.time in pandas.

darothen · 2017-04-17T23:44:08Z

Turns out it was easy enough to add an accessor for ds['time.time']; that's already provided via pandas.

shoyer

Looks good, once you fix a few minor points. You'll need to merge master to get this in "what's new".

shoyer · 2017-04-17T23:48:57Z

xarray/core/dataset.py

        else:
-            data = getattr(date, var_name)
+            data = getattr(ref_var)


This is broken -- I think this clause should be removed instead.

I think it's a typo; I intended the whole if-then-else block to read:

if is_datetime_like(ref_var.dtype): ref_var = xr.DataArray(ref_var) data = getattr(ref_var.dt, var_name).data else: data = getattr(ref_var, var_name).data

We only access datetime virtual fields for now, but this leaves the possibility of other types of virtual field access at some point in the future.

shoyer · 2017-04-17T23:49:27Z

xarray/core/dataset.py

-            data = seasons[(month // 3) % 4]
+        if is_datetime_like(ref_var.dtype):
+            ref_var = xr.DataArray(ref_var)
+            data = getattr(ref_var.dt, var_name)


probably should use getattr(ref_var.dt, var_name).data to avoid passing a DataArray instance into xarray.Variable.

shoyer · 2017-04-17T23:51:19Z

xarray/core/accessors.py

+    """
+    if isinstance(values, dask_array_type):
+        from dask.array import map_blocks
+        out_dtype = object if name == "weekday_name" else np.int64


Should also be object for season and time.

Actually, rather than put the logic here, can you add a dtype argument to _tslib_field_accessor and pass it on?

Yes - went ahead and did that. dtype now passes through the whole chain of functions, but is only used if we end up calling map_blocks on a dask array.

darothen · 2017-04-18T12:53:01Z

Alrighty, patched and ready for a final look-over! I appreciate the help and patience, @shoyer!

shoyer · 2017-04-29T01:19:12Z

Thanks @darothen! This is really nice

* Start on implementing and testing NetCDFTimeIndex * TST Move to using pytest fixtures to structure tests * Address initial review comments * Address second round of review comments * Fix failing python3 tests * Match test method name to method name * First attempts at integrating NetCDFTimeIndex into xarray This is a first pass at the following: - Resetting the logic for decoding datetimes such that `np.datetime64` objects are never used for non-standard calendars - Adding logic to use a `NetCDFTimeIndex` whenever `netcdftime.datetime` objects are used in an array being cast as an index (so if one reads in a Dataset from a netCDF file or creates one in Python, which is indexed by a time coordinate that uses `netcdftime.datetime` objects a NetCDFTimeIndex will be used rather than a generic object-based index) - Adding logic to encode `netcdftime.datetime` objects when saving out to netCDF files * Cleanup * Fix DataFrame and Series test failures for NetCDFTimeIndex These were related to a recent minor upstream change in pandas: https://github.com/pandas-dev/pandas/blame/master/pandas/core/indexing.py#L1433 * First pass at making NetCDFTimeIndex compatible with #1356 * Address initial review comments * Restore test_conventions.py * Fix failing test in test_utils.py * flake8 * Update for standalone netcdftime * Address stickler-ci comments * Skip test_format_netcdftime_datetime if netcdftime not installed * A start on documentation * Fix failing zarr tests related to netcdftime encoding * Simplify test_decode_standard_calendar_single_element_non_ns_range * Address a couple review comments * Use else clause in _maybe_cast_to_netcdftimeindex * Start on adding enable_netcdftimeindex option * Continue parametrizing tests in test_coding_times.py * Update time-series.rst for enable_netcdftimeindex option * Use :py:func: in rst for xarray.set_options * Add a what's new entry and test that resample raises a TypeError * Move what's new entry to the version 0.10.3 section * Add version-dependent pathway for importing netcdftime.datetime * Make NetCDFTimeIndex and date decoding/encoding compatible with datetime.datetime * Remove logic to make NetCDFTimeIndex compatible with datetime.datetime * Documentation edits * Ensure proper enable_netcdftimeindex option is used under lazy decoding Prior to this, opening a dataset with enable_netcdftimeindex set to True and then accessing one of its variables outside the context manager would lead to it being decoded with the default enable_netcdftimeindex (which is False). This makes sure that lazy decoding takes into account the context under which it was called. * Add fix and test for concatenating variables with a NetCDFTimeIndex Previously when concatenating variables indexed by a NetCDFTimeIndex the index would be wrongly converted to a generic pd.Index * Further namespace changes due to netcdftime/cftime renaming * NetCDFTimeIndex -> CFTimeIndex * Documentation updates * Only allow use of CFTimeIndex when using the standalone cftime Also only allow for serialization of cftime.datetime objects when using the standalone cftime package. * Fix errant what's new changes * flake8 * Fix skip logic in test_cftimeindex.py * Use only_use_cftime_datetimes option in num2date * Require standalone cftime library for all new functionality Add tests/fixes for dt accessor with cftime datetimes * Improve skipping logic in test_cftimeindex.py * Fix skipping logic in test_cftimeindex.py for when cftime or netcdftime are not available. Use existing requires_cftime decorator where possible (i.e. only on tests that are not parametrized via pytest.mark.parametrize) * Fix skip logic in Python 3.4 build for test_cftimeindex.py * Improve error messages when for when the standalone cftime is not installed * Tweak skip logic in test_accessors.py * flake8 * Address review comments * Temporarily remove cftime from py27 build environment on windows * flake8 * Install cftime via pip for Python 2.7 on Windows * flake8 * Remove unnecessary new lines; simplify _maybe_cast_to_cftimeindex * Restore test case for #2002 in test_coding_times.py I must have inadvertently removed it during a merge. * Tweak dates out of range warning logic slightly to preserve current default * Address review comments

Add DatetimeAccessor for accessing datetime fields via .dt attribute

4c4447d

shoyer reviewed Apr 7, 2017

View reviewed changes

darothen added 4 commits April 7, 2017 10:57

Cleaning up unit tests

c2bdd40

Cleaning up comments and warnings in accessors

f51c707

Indirectly access pandas tslib through Series accessors

5308642

Re-factor injection of datetime field accessor properties

621d144

shoyer reviewed Apr 7, 2017

View reviewed changes

darothen added 5 commits April 10, 2017 07:53

Undo loop/injection of _get_date_field accessors

daaacb7

Remove public-facing dt property

0198c3d

Remove extra 'field' argument from _tslib_field_accessor

8c38bf9

Added support for dask arrays

d4fcb49

Added dask test cases

9a616ef

Fixed a bug where data wasn't computed in correct order

shoyer reviewed Apr 11, 2017

View reviewed changes

darothen added 5 commits April 12, 2017 09:54

Simplified _get_date_field for both dask/numpy arrays; additional cod…

e303e89

…e review cleanups

Fixing flake8 complaints

e35d0d0

Adding whats-new entry

f621ecc

Updated timeseries docs with note about dt accessor

74f8756

Merge branch 'master' into dt_accessor

5ae4e08

shoyer reviewed Apr 14, 2017

View reviewed changes

darothen added 3 commits April 17, 2017 12:20

Moved season accessor to DatetimeAccessor

0788549

Re-factor virtual variable logic to lean on DateTimeAccessor

d842159

Added "Returns" documentation to _get_date_field

426a16a

Fixed imports to facilitate more direct implementation of DateTimeAccessor as a property in DataArray Moved _access_through_series to a top-level function in accessors.py so that dask serialization will hopefully work a bit better

Adding timestamp accessor

66cdb59

shoyer reviewed Apr 17, 2017

View reviewed changes

darothen added 4 commits April 18, 2017 07:24

Hard-coding expected dtypes for each datetime field

9c2fe26

Fix typo in non-datetime virtual variable access

14ac55c

Merge branch 'master' into dt_accessor

d50420b

Update What's New and timeseries docs

b286313

shoyer merged commit 8f6a68e into pydata:master Apr 29, 2017

shoyer mentioned this pull request May 11, 2017

CFTimeIndex #1252

Merged

4 tasks

spencerkclark added a commit to spencerkclark/xarray that referenced this pull request May 11, 2017

First pass at making NetCDFTimeIndex compatible with pydata#1356

b942724

shoyer mentioned this pull request Sep 11, 2017

Regression: time attributes on PeriodIndex #1565

Open

		return _access_through_series(values, name)


		@register_dataarray_accessor('dt')

Uh oh!

Add DatetimeAccessor for accessing datetime fields via .dt attribute #1356

Add DatetimeAccessor for accessing datetime fields via .dt attribute #1356

Uh oh!

Conversation

darothen commented Apr 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Virtual time fields

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

darothen commented Apr 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

darothen commented Apr 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

darothen commented Apr 13, 2017

Uh oh!

shoyer commented Apr 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

darothen commented Apr 17, 2017

Uh oh!

Add DatetimeAccessor for accessing datetime fields via `.dt` attribute #1356

Add DatetimeAccessor for accessing datetime fields via `.dt` attribute #1356

darothen commented Apr 6, 2017 •

edited

Loading