-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix concatenating Variables with dtype=datetime64 #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a more direct fix for the bug @akleeman reported pydata#125. Previously, it did not always work to modify Variable.values in-place (as done in Variable.concat), because as_array_or_item could return a copy of an ndarray instead of a view. @akleeman I didn't incorporate your tests for the
@@ -73,7 +87,10 @@ def _as_compatible_data(data): | |||
# check pd.Index first since it's (currently) an ndarray subclass | |||
data = PandasIndexAdapter(data) | |||
elif isinstance(data, np.ndarray): | |||
data = NumpyArrayAdapter(utils.as_safe_array(data)) | |||
if data.dtype.kind == 'M': | |||
data = data.astype('datetime64[ns]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean any non-index that holds dates will get automatically copied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. We should be using np.asarray
or set copy=False
with astype:
By default, astype always returns a newly allocated array. If this is set to false, and the dtype, order, and subok requirements are satisfied, the input array is returned instead of a copy.
I don't know why I thought astype wouldn't make a copy unless necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it looks like np.asarray(data, 'datetime64[ns]')
works.
The reorganization does make things cleaner, but the behavior changed relative to #125. In particular, while this patch fixes concatenation with datetime64 times it doesn't work with datetimes:
and not sure if this was broken before or not ...
|
With regards to datetime.datetime being converted into integers, the issue is that pandas currently does dtype inference when indexing an Index (pandas-dev/pandas#6370). Fortunately the next version of pandas (0.14), due out in a few weeks, stops doing this. I had made some attempts to fix this previously but it was untested and only sort of worked. I think this latest commit should fix that up. Let me check on that second issue... |
The later is a (newly exposed) bug in Looks like we'll need to add a work-around for now... |
This is a work-around for this upstream issue: pandas-dev/pandas#7176
@akleeman I have a work-around up and ready for testing. |
Yeah all fixed. In #125 I went the route of forcing datetimes to be datetime64[ns]. This is probably part of a broader conversation, but doing so might save some future headaches. Of course ... it would also restrict us to nanosecond precision. Basically I feel like we should either force datetimes to be datetime64[ns] or make sure that operations on datetime objects preserve their type. Probably worth getting this in and picking that conversation back up if needed. In which case could you add tests which make sure variables with datetime objects are still datetime objects after concatenation? If those start getting cast to datetime[ns] it'll start get confusing for users. |
I agree, we should either ensure datetime64[ns] or ensure that operations on datetime objects preserve dtype. The latest commit should verify this. My main reason for not doing the former is that I thought it would be nice (in theory) to support using plain datetime objects if datetime64[ns] does not have a long enough time range for some users. But for most users, I suspect they would indeed rather have datetime64[ns]. |
Also worth considering: how should datetime64[us] datetimes be handled? Currently they get cast to [ns] which, since datetimes do not, could get confusing. |
The reason I cast all datetime64 to datetime64[ns] is because pandas will not let you make an Index of datetime64 objects with anything other than ns precision. If you try to make it an Index with dtype=object you'll actually get an array of datetime.datetime objects:
But I do agree this is not terribly consistent nor fully thought through. And it should certainly be well-documented. |
Fix concatenating Variables with dtype=datetime64
This is an alternative to #125 which I think is a little cleaner.
Basically, there was a bug where
Variable.values
for datetime64 arrays always made a copy of values. This made it impossible to edit variable values in-place.@akleeman would appreciate your thoughts.