Skip to content

Add DatetimeAccessor for accessing datetime fields via .dt attribute #1356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Apr 29, 2017

Conversation

darothen
Copy link

@darothen darothen commented Apr 6, 2017

This uses the register_dataarray_accessor to add attributes similar to those in pandastimeseries which let the users quickly access datetime fields from an underlying array of datetime-like values. The referenced issue (#358) also asks about adding similar accessors for.str`, but this is a more complex topic - I think a compelling use-case would help in figuring out what the critical functionality is

Virtual time fields

Presumably this could be used to augment Dataset._get_virtual_variable(). A season field would need to be added as a special field to the accessor.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, thanks!

def f(self):
from .dataarray import DataArray
values = self.dt
result = libts.get_date_field(values, field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using private pandas API, so it's liable to break in the future. Instead, let's wrap things in a pandas.Series. Something like this should do the trick:

def get_dt_field(array: np.ndarray, name: str):
    series = pd.Series(array.ravel())
    field_values = getattr(series.dt, name).values
    return field_values.reshape(array.shape)

... freq='D', periods=100)})
>>> ds.time.dt
<xarray.core.accessors.DatetimeAccessor at 0x10c369f60>
>>> ds.time.dt.dayofyear[5]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be [:5], not [5]?


All of the pandas fields are accessible here. Note that these fields are not
calendar-aware; if your datetimes are encoded with a non-Gregorian calendar
(e.g. a 360-day calendar), then some fields like `dayofyear` may not be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "using netcdftime" just to be super explicit about where special calendars come from.

def __init__(self, xarray_obj):
if not is_datetime_like(xarray_obj.dtype):
raise TypeError("'dt' accessor only available for "
"DataArray with datetime64 dtype")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "or timedelta64 dtype"

days = self.times.day
hours = self.times.hour

self.assertArrayEqual(years, self.data.time.dt.year)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self.assertDataArrayEqual to check that fields provide data arrays with the appropriate metadata.


def test_not_datetime_type(self):
nontime_data = self.data.copy()
nontime_data['time'].values = \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid explicit backlashes for continuation if possible, per PEP8.

nontime_data['time'].values = \
np.arange(len(self.data.time)).astype('int8')
with self.assertRaisesRegexp(TypeError, 'dt'):
nontime_data.time.dt.year
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline on last line

@darothen
Copy link
Author

darothen commented Apr 7, 2017

@shoyer I corrected things based on your comments. The last commit is an attempt to refactor things to match the way that methods like rolling/groupby functions are injected into the class; this might be totally superfluous here, but I thought it was worth trying.


@property
def dt(self):
"""Attribute to cache a view of the underlying datetime-like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to expose this as public facing API. So I would get ride of this property, probably.

('daysinmonth', "The number of days in the month"),
]

def inject_date_field_accessors(cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but I actually did like what you did before -- and we'll probably be switching xarray to use that style in the future, instead of injecting things with a loop. Using a function to define the method directly on the class is much more explicit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'll revert back to the explicit style.

array for passing to pandas.tslib for date_field operations
"""
if self._dt is None:
self._dt = self._obj.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something possible worth thinking about, at least a little: What happens if .dt is called on an object that stores its values in dask array? Ideally, we would handle this transparently, by either calling _get_date_field directly (on numpy arrays) or using dask.array.map_blocks to call _get_date_field (on dask arrays).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see the potential for an issue here, but from some limited testing I'm not sure where it will arise. For instance, consider this contrived example:

import numpy as np
import pandas as pd
import xarray as xr
from dask import array as da

nt = 100000
data = np.random.rand(10, 10, nt)
lons = np.linspace(0, 11, 10)
lats = np.linspace(0, 20, 10)
times = pd.date_range(start="2000/01/01", freq='D', periods=nt)

datest = xr.DataArray(data, coords=[lons, lats, times], 
                  dims=['lon', 'lat', 'time'], name='data')

tc = da.from_array(da2.time.values, chunks=5000)
datest['time2'] = (('time', ), tc)
datest.time2.dt.month

This works just fine. I tried creating an example where I transform the data underlying datest['time'] into a dask array, but I can't create one that doesn't eagerly load the data into memory, no matter how I try to chunk it. Do dimensions automatically load dask arrays?

Chunking on datest itself was not a problem, either again, because accessing datest.time always gave an in-memory array.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can only put a dask array in xarray if it isn't an dimension coordinate, so in practice these are pretty rare for datetime64 variables but they do come up occasionally (e.g., consider a 2D variable time for a collection of weather forecasts with dimensions for reference time and time since the start of the forecast).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. A few more lines of code to add to the example before:

times_arr = np.random.choice(times, size=(10, 10, nt))

dstest = datest.to_dataset()
dstest['all_times'] = (('lon', 'lat', 'time'), times_arr)

dstest = dstest.chunk({'lon': 5, 'lat': 5, 'time': 5000})

So now we're definitely holding a dask array:

>>> dstest.all_times
<xarray.DataArray 'all_times' (lon: 10, lat: 10, time: 100000)>
dask.array<xarray-..., shape=(10, 10, 100000), dtype=datetime64[ns], chunksize=(5, 5, 5000)>
Coordinates:
  * lon      (lon) float64 0.0 1.222 2.444 3.667 4.889 6.111 7.333 8.556 ...
  * lat      (lat) float64 0.0 2.222 4.444 6.667 8.889 11.11 13.33 15.56 ...
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
    time2    (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...

but our datetime accessors still work just fine:

>>> dstest.all_times.dt.month
<xarray.DataArray 'month' (lon: 10, lat: 10, time: 100000)>
array([[[ 3,  5, ..., 11,  5],
        [ 2, 12, ...,  8,  4],
        ..., 
        [ 9, 11, ...,  6, 11],
        [ 6,  1, ..., 12,  8]],
...

Here I cache the values of the DataArray passed in, and that's what the accessor checks to figure out the requested field. Are those always eagerly evaluated? If so, then it would make sense why everything works out okay - we've stored an array in memory, and we're converting that to a Pandas Series.

Things also work fine if I create a chunked dask array from times_arr above and store it directly in the Dataset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current version of the accessor works, but only by loading the entire dask array into memory as a NumPy array first. That's a no-no: operations on dask arrays should be lazy or raise an error if that isn't possible..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I understand now. I'll see what I can do about safely handling dask arrays here; I know there's plenty of examples to emulate across the xarray code base.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darothen
Copy link
Author

Updated with support for multi-dimensional time data stored as dask array.

attr_values = getattr(accessor, name).values
return attr_values.compute()

def _ravel_and_access(darr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more complicated than it needs to, as evidenced by the nested calls to compute. The great thing about map_blocks is that you can make everything eager, e.g.,

def _ravel_and_access(darr):
    values_as_series = pd.Series(values.ravel())
    field_values = getattr(values_as_series.dt, name).values
    return field_values.reshape(values.shape)

That's it!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! The time spent on tackling this project is well worth it just for this tidbit alone. I'll circle back to clean up this stuff within the next day or so.

dtype=raveled.dtype)
return field_values.reshape(darr.shape).compute()

return map_blocks(_ravel_and_access, values, dtype=values.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtype needs to be the dtype of the return value, not the input. For most of these accessors, this is just np.int64 (for strings, use dtype=object).

int_data = np.arange(len(self.data.time)).astype('int8')
nontime_data['time'].values = int_data
with self.assertRaisesRegexp(TypeError, 'dt'):
nontime_data.time.dt.year
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use simply nontime_data.time.dt to make the test a little more stringent.


# Safely pre-compute comparison fields by passing through Pandas
# machinery
def _getattr_and_reshape(arr, attr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK if this unit test relies on the correctness of computing date fields on NumPy arrays -- you already check that in other tests. So I would get rid of this and simply rely on these fields already working, e.g., just use self.times_arr.dt.year.

self._obj = xarray_obj
self._dt = self._obj.data

_field_ops = ['year', 'month', 'day', 'hour', 'minute', 'second',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this variable anymore.

@darothen
Copy link
Author

Finished clean-up, added some documentation, etc. I mangled resolving a merge conflict with my update to whats-new.rst (5ae4e08) in terms of the commit text, but other than that I think we're getting closer to finishing this.

wrt to the virtual variables, I think some more thinking is necessary so we can come up with a plan of approach. Do we want to deprecate this feature entirely? Do we just want to wrap the datetime component virtual variables to the .dt accessor if they're datetime-like? We could very easily do the latter for 0.9.3, but maybe we should target a future major release to deprecate the virtual variables and instead encourage adding a few specialized (but commonly-used) accessors to xarray?

@shoyer
Copy link
Member

shoyer commented Apr 14, 2017

wrt to the virtual variables, I think some more thinking is necessary so we can come up with a plan of approach. Do we want to deprecate this feature entirely?

This is a good question, possibly worth raising in a separate issue for discussion. I think they are a convenient shortcut, but they shouldn't be the primary interface.

Do we just want to wrap the datetime component virtual variables to the .dt accessor if they're datetime-like?

Yes, certainly. We definitely don't want to have two separate implementations of datetime components in the code base.

def f(self):
from .dataarray import DataArray
result = _get_date_field(self._dt, name)
return DataArray(result, name=name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also just do type(self._obj) here instead of DataArray, and the result will probably work better on subclasses

def _tslib_field_accessor(name, docstring=None):
def f(self):
from .dataarray import DataArray
result = _get_date_field(self._dt, name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure why you need self._dt. Why not just use self._obj.data here?

Name of datetime field to access

"""
def _access_through_series(values, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For serialization in dask, it's better to pass top-level functions to map_blocks. So I'd just move this to the top level.

Array-like container of datetime-like values
name : str
Name of datetime field to access

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Returns

@@ -100,7 +115,8 @@ __ http://pandas.pydata.org/pandas-docs/stable/api.html#time-date-components
ds['time.month']
ds['time.dayofyear']

xarray adds ``'season'`` to the list of datetime components supported by pandas:
For use as a derived coordinate, xarray adds ``'season'`` to the list of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to move the logic for season and the other datetime virtual variables to use .dt.

return _access_through_series(values, name)


@register_dataarray_accessor('dt')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: since this isn't being defined outside of xarray, it might actually make sense to import this class from datarray.py and simply add it to the DataArray class directly:

class DataArray:
    ...
    dt = property(DatatimeAccessor)

That has the advantage of being more transparent about whether it comes from.

darothen added 3 commits April 17, 2017 12:20
Fixed imports to facilitate more direct implementation of DateTimeAccessor as a property in DataArray
Moved _access_through_series to a top-level function in accessors.py so that dask serialization will hopefully work a bit better
@darothen
Copy link
Author

There's a test-case relating to #367 (test_virtual_variable_same_name) which is causing me a bit of grief as I re-factor the virtual variable logic. Should we really be able to access variables like ds['time.time']? This seems to break the logic of what a virtual variable does, and was implemented to help out with time GroupBys and resampling (something I'll eventually get around to finishing up a refactor for - #1272).

Two options for fixing:

  1. add a "time" field to DateTimeAccessor.
  2. add some additional if-then-else logic to _get_virtual_variable.

@shoyer
Copy link
Member

shoyer commented Apr 17, 2017

add a "time" field to DateTimeAccessor.

This would be my preferred fix.

I agree that ds['time.time'] is a little confusing but think it's unambiguous: it says get the "time" component (an array with datetime.time elements) from the "time" variable (an array of dtype=datetime64), i.e., equivalent to df['time'].dt.time in pandas.

@darothen
Copy link
Author

Turns out it was easy enough to add an accessor for ds['time.time']; that's already provided via pandas.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, once you fix a few minor points. You'll need to merge master to get this in "what's new".

else:
data = getattr(date, var_name)
data = getattr(ref_var)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is broken -- I think this clause should be removed instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a typo; I intended the whole if-then-else block to read:

        if is_datetime_like(ref_var.dtype):
            ref_var = xr.DataArray(ref_var)
            data = getattr(ref_var.dt, var_name).data
        else:
            data = getattr(ref_var, var_name).data

We only access datetime virtual fields for now, but this leaves the possibility of other types of virtual field access at some point in the future.

data = seasons[(month // 3) % 4]
if is_datetime_like(ref_var.dtype):
ref_var = xr.DataArray(ref_var)
data = getattr(ref_var.dt, var_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably should use getattr(ref_var.dt, var_name).data to avoid passing a DataArray instance into xarray.Variable.

"""
if isinstance(values, dask_array_type):
from dask.array import map_blocks
out_dtype = object if name == "weekday_name" else np.int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also be object for season and time.

Actually, rather than put the logic here, can you add a dtype argument to _tslib_field_accessor and pass it on?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - went ahead and did that. dtype now passes through the whole chain of functions, but is only used if we end up calling map_blocks on a dask array.

@darothen
Copy link
Author

Alrighty, patched and ready for a final look-over! I appreciate the help and patience, @shoyer!

@shoyer shoyer merged commit 8f6a68e into pydata:master Apr 29, 2017
@shoyer
Copy link
Member

shoyer commented Apr 29, 2017

Thanks @darothen! This is really nice

@shoyer shoyer mentioned this pull request May 11, 2017
4 tasks
spencerkclark added a commit to spencerkclark/xarray that referenced this pull request May 11, 2017
jhamman pushed a commit that referenced this pull request May 13, 2018
* Start on implementing and testing NetCDFTimeIndex

* TST Move to using pytest fixtures to structure tests

* Address initial review comments

* Address second round of review comments

* Fix failing python3 tests

* Match test method name to method name

* First attempts at integrating NetCDFTimeIndex into xarray

This is a first pass at the following:
- Resetting the logic for decoding datetimes such that `np.datetime64` objects
are never used for non-standard calendars
- Adding logic to use a `NetCDFTimeIndex` whenever `netcdftime.datetime`
objects are used in an array being cast as an index (so if one reads in a
Dataset from a netCDF file or creates one in Python, which is indexed by a time
coordinate that uses `netcdftime.datetime` objects a NetCDFTimeIndex will be
used rather than a generic object-based index)
- Adding logic to encode `netcdftime.datetime` objects when saving out to
netCDF files

* Cleanup

* Fix DataFrame and Series test failures for NetCDFTimeIndex

These were related to a recent minor upstream change in pandas:
https://github.com/pandas-dev/pandas/blame/master/pandas/core/indexing.py#L1433

* First pass at making NetCDFTimeIndex compatible with #1356

* Address initial review comments

* Restore test_conventions.py

* Fix failing test in test_utils.py

* flake8

* Update for standalone netcdftime

* Address stickler-ci comments

* Skip test_format_netcdftime_datetime if netcdftime not installed

* A start on documentation

* Fix failing zarr tests related to netcdftime encoding

* Simplify test_decode_standard_calendar_single_element_non_ns_range

* Address a couple review comments

* Use else clause in _maybe_cast_to_netcdftimeindex

* Start on adding enable_netcdftimeindex option

* Continue parametrizing tests in test_coding_times.py

* Update time-series.rst for enable_netcdftimeindex option

* Use :py:func: in rst for xarray.set_options

* Add a what's new entry and test that resample raises a TypeError

* Move what's new entry to the version 0.10.3 section

* Add version-dependent pathway for importing netcdftime.datetime

* Make NetCDFTimeIndex and date decoding/encoding compatible with datetime.datetime

* Remove logic to make NetCDFTimeIndex compatible with datetime.datetime

* Documentation edits

* Ensure proper enable_netcdftimeindex option is used under lazy decoding

Prior to this, opening a dataset with enable_netcdftimeindex set to True
and then accessing one of its variables outside the context manager would
lead to it being decoded with the default enable_netcdftimeindex
(which is False).  This makes sure that lazy decoding takes into account
the context under which it was called.

* Add fix and test for concatenating variables with a NetCDFTimeIndex

Previously when concatenating variables indexed by a NetCDFTimeIndex
the index would be wrongly converted to a generic pd.Index

* Further namespace changes due to netcdftime/cftime renaming

* NetCDFTimeIndex -> CFTimeIndex

* Documentation updates

* Only allow use of CFTimeIndex when using the standalone cftime

Also only allow for serialization of cftime.datetime objects when
using the standalone cftime package.

* Fix errant what's new changes

* flake8

* Fix skip logic in test_cftimeindex.py

* Use only_use_cftime_datetimes option in num2date

* Require standalone cftime library for all new functionality

Add tests/fixes for dt accessor with cftime datetimes

* Improve skipping logic in test_cftimeindex.py

* Fix skipping logic in test_cftimeindex.py for when cftime or netcdftime
are not available.  Use existing requires_cftime decorator where possible
(i.e. only on tests that are not parametrized via pytest.mark.parametrize)

* Fix skip logic in Python 3.4 build for test_cftimeindex.py

* Improve error messages when for when the standalone cftime is not installed

* Tweak skip logic in test_accessors.py

* flake8

* Address review comments

* Temporarily remove cftime from py27 build environment on windows

* flake8

* Install cftime via pip for Python 2.7 on Windows

* flake8

* Remove unnecessary new lines; simplify _maybe_cast_to_cftimeindex

* Restore test case for #2002 in test_coding_times.py

I must have inadvertently removed it during a merge.

* Tweak dates out of range warning logic slightly to preserve current default

* Address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add .dt and .str accessors to DataArray (like pandas.Series)
2 participants