-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: first() changes datetime64 data #9311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This looks like a bug to me. I can reproduce this on master. |
I have isolated it to this line. https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L1818 I'm going to see if I can determine if this is a numpy bug or if we just need to pass some additional parameters to numpy.ndarray.astype. |
Interesting... numpy casting defaults to "unsafe" for backward compatibility reasons. When I pass in "safe": result = result.astype('M8[ns]', casting='safe')
TypeError: Cannot cast array from dtype('float64') to dtype('<M8[ns]') according to the rule 'safe' So numpy is even aware that this can't be done safely. |
If I leave casting="safe", then all is well, as it raises an error and fails over safely: def _groupby_function(name, alias, npfunc, numeric_only=True,
_convert=False):
def f(self):
self._set_selection_from_grouper()
try:
return self._cython_agg_general(alias, numeric_only=numeric_only)
except AssertionError as e:
raise SpecificationError(str(e))
except Exception:
result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
if _convert:
result = result.convert_objects()
return result
f.__doc__ = "Compute %s of group values" % name
f.__name__ = name
return f Result: a dateCreated
0 1 2011-01-20 12:50:28.593448
1 1 2011-01-15 12:50:28.502376
2 1 2011-01-15 12:50:28.472790
3 1 2011-01-15 12:50:28.445286
dateCreated
a
1 2011-01-20 12:50:28.593448 I'll work up a PR and a test... and see if I break all the things. |
The real question is why this is being cast to float64 in the first place. We might need to duplicate an internal cython routine so that there is an int64 version that we can safely use without needing any casting. |
@shoyer I'll take a peek at when that gets converted to float64. I didn't catch that at all, but you're right. It should be able to safely convert an int64. |
It runs an ensure_float on this line: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1508 |
Alright, looks like the cython in generated.pyx needs to have functions for group_nth_ for int8, int16, int32, and int64. Sound right to you @shoyer? |
@iwschris That file is generated from |
Cool. I have a branch started that should fix this, and I'll enable integers for now, and see how things go from there. |
@iwschris this would be a nice fix |
First Attempt: Don't convert Integers To Floats in CythonSo, this is tightly wound around the idea that there is no integer nan. Once I enabled integers in generate_code.py, the resulting generated.pyx simply converted integers to float64, which is precisely the thing we're trying to solve here. I modified the cython templates so that integers could remain integers, but obviously the problem of how to represent integer nan still remained. Just to see if it solved the problem at hand, on the integer groupby templates, I used 0 for nan, and the immediate problem was solved. Using 0 for nan is not an acceptable solution though, and I broke about 20 tests, where the integer to float64 conversion has already been assumed. Suggested Solution: Coerce safely with a fall-backThis prioritizes data safety over performance, but only in a very narrow band where we know that there is an issue. I'm proposing that we change this line in core/internals.py to this: result = result.astype('M8[ns]', casting='safe') All the fallback code already exists, and the change breaks no tests. Thoughts? |
@iwschris What does the fallback code look like? I'm guessing that skips Cython entirely? I would rather be correct than fast, so that would probably be OK, but we should know what the performance impact is first. Are there any major regressions in the vbench performance suite? (see here for instructions) Our usual approach to handling missing values for datetime64 is to check for values equal to |
@shoyer that's right, it skips Cython. I'll do the vbench stuff and post the results here. As far as iNaT goes, the issue here is that by the time we get into those Cython functions the datetime64 has already been converted to an Integer. |
@iwschris it's actually probably fine to do
|
Excellent. I'll check it out then. Thanks! |
Hah... the relevant part is here:
...point taken. I'll keep working on the Cython :) |
Cython is fixed, all tests passing, vbench is looking good as well. I'll have the PR in today. |
Here it is: #9345 |
I have a dataframe that contains a datetime64 column that seems to be losing some digits after a groupby.first(). I know that under the hood these are stored at nanosecond precision, but I need them to remain equal, since I'm merging back onto the original frame later on. Is this expected behavior?
Output is:
When I compare the datetime64 in the first row to the datetime64 after the groupby.first(), the two are not equal.
The text was updated successfully, but these errors were encountered: