-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix issue with incorrect groupby handling of NaT #10625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test would be nice |
I have tried to add a test and update to doc. I am not very sure I have done this correctly :) |
I don't know what the next step here is? Am I supposed to do something now or am I waiting for someone to have a look at what I have done? There is a This branch has conflicts that must be resolved Can I see what the conflicts are and fix them? |
@larvian rebase your branch on upstream master to resolve the conflicts |
@MaximilianR Thanks! I will try to figure out how to do that :) |
lots of tips here |
Also, can you add tests for
I think you can use |
Hm... I can't get this rebase to work. It gives conflicts which seems to come back after I fix :( |
We can help walk you through the git problems. Could be easier in the chat room: https://gitter.im/pydata/pandas |
Thanks Tom! |
d716ba0
to
f2759bc
Compare
Just curious, what's next with this PR? |
@@ -2608,6 +2608,8 @@ def _cython_agg_blocks(self, how, numeric_only=True): | |||
for block in data.blocks: | |||
|
|||
values = block._try_operate(block.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to do this differently.
move the ._try_operate
call to inside .aggregate
(its line 1533). Then you can use do the iNaT
to np.nan
there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok,
Then I assume I would do the ._try_operate
after checking is_datetime_or_timedelta_dtype
not to lose dtype.
But ._try_operate
on DatetimeBlock
is just values.view('i8')
However the code just after is_datetime_or_timedelta_dtype
is values = values.view('int64')
I assume both are not needed so I can remove ._try_operate
Hope this doesn't destroy something else.
4a10c42
to
476891f
Compare
def test_first_last_max_min_on_time_data(self): | ||
from datetime import timedelta as td | ||
DF_dt_test=DataFrame({'dt':[nan,'2015-07-24 10:10','2015-07-25 11:11','2015-07-23 12:12',nan],'td':[nan,td(days=1) ,td(days=2),td(days=3),nan]}) | ||
DF_dt_test.dt=pd.to_datetime(DF_dt_test.dt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrap this line
use df as a variable name (or maybe df_test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
0c4d1d9
to
c200f66
Compare
a4d5cf8
to
aa402ae
Compare
aa402ae
to
b112b3a
Compare
can you rebase / update |
b112b3a
to
1cf7b2f
Compare
Sure... rebase done. Queued for Travis CI |
For groupby the time stamps gets converted to integervalue tslib.iNaT which is -9223372036854775808. The aggregation is then done using this value with incorrect result as a consequence. The solution proposed here is to replace its value by np.nan in case it is a datetime or timedelta.
1cf7b2f
to
9c2d1a6
Compare
Rebased yesterday again and it passed Travis CI. |
to avoid conflicts, don't insert the line at the end of the notes, put it somewhere else above. This is why I have left lots of blank space (which gradually gets filled up). |
ping when green. |
This one is green now |
BUG: Fix issue with incorrect groupby handling of NaT
thanks! |
this added line has broken cython aggregation. You cannot assign before this patch: In [1]: from string import ascii_lowercase
In [2]: np.random.seed(2718281)
In [3]: n = 1 << 21
In [4]: dr = date_range('2015-08-30', periods=n // 10, freq='T')
In [5]: df = DataFrame({
...: '1st':np.random.choice(list(ascii_lowercase), n),
...: '2nd':np.random.randint(0, 5, n),
...: '3rd':np.random.choice(dr, n)})
In [6]: df.loc[np.random.choice(n, n // 10), '3rd'] = np.nan
In [7]: gr = df.groupby(['1st', '2nd'])
In [8]: %timeit gr.count()
The slowest run took 21.22 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 13.3 ms per loop
In [9]: %timeit gr.count()
100 loops, best of 3: 13.8 ms per loop
In [10]: pd.__version__
Out[10]: '0.16.2+521.g207efc2' with this patch: In [8]: %timeit gr.count()
1 loops, best of 3: 144 ms per loop
In [9]: %timeit gr.count()
10 loops, best of 3: 149 ms per loop
In [10]: pd.__version__
Out[10]: '0.16.2+522.g9c2d1a6' |
yes that should be iNaT odd that the tests didn't pick this up |
@jreback it is modifying values which are already
|
this should already be a float here |
ahh I see it is raising, being cause and taking the slow path...ok.thanks for the catch |
@behzadnouri @jreback Thanks for pointing out the problem. |
closes #10590
For groupby the time stamps gets converted to integervalue
tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a
datetime64[ns]