BUG: Fix issue with incorrect groupby handling of NaT #10625

larvian · 2015-07-19T06:51:30Z

For groupby the time stamps gets converted to integervalue tslib.iNaT
which is -9223372036854775808. The aggregation is then done using this
value with incorrect result as a consequence. The solution proposed here
is to replace its value by np.nan in case it is a datetime64[ns]

jreback · 2015-07-19T10:36:09Z

test would be nice

larvian · 2015-07-23T16:51:40Z

I have tried to add a test and update to doc. I am not very sure I have done this correctly :)

larvian · 2015-07-24T21:35:09Z

I don't know what the next step here is? Am I supposed to do something now or am I waiting for someone to have a look at what I have done? There is a This branch has conflicts that must be resolved Can I see what the conflicts are and fix them?

max-sixty · 2015-07-24T21:40:39Z

@larvian rebase your branch on upstream master to resolve the conflicts

larvian · 2015-07-24T21:47:07Z

@MaximilianR Thanks! I will try to figure out how to do that :)

jreback · 2015-07-24T22:35:06Z

lots of tips here

sinhrks · 2015-07-24T22:44:36Z

Also, can you add tests for timedelta which looks affected also?

df = pd.DataFrame({'IX': ['A', 'A'], 'num': [np.nan, 100],
                  'dt': [np.nan, datetime.datetime.now()],
                  'td':[np.nan, datetime.timedelta(days=1)]})
# OK
df.min()
# IX                              A
# dt    2015-07-25 07:41:47.871654
# num                           100
# td                1 days 00:00:00
# dtype: object

# NG
# df.groupby('IX').min()
#     dt  num  td
# IX             
# A  NaT  100 NaT

I think you can use com.is_datetime_or_timedelta_dtype to check dtype.

https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L2517

larvian · 2015-07-24T22:51:23Z

@jreback Thanks!
@sinhrks Ok I will look at that too.

larvian · 2015-07-25T01:03:19Z

Hm... I can't get this rebase to work. It gives conflicts which seems to come back after I fix :(
It seems to get messier the more I try.
Perhaps it is easiest to create a new pandas clone and apply my changes there?
I have to leave it for another day.

TomAugspurger · 2015-07-26T17:47:56Z

We can help walk you through the git problems. Could be easier in the chat room: https://gitter.im/pydata/pandas

larvian · 2015-07-26T18:41:29Z

Thanks Tom!
I think I need it :)

larvian · 2015-08-03T20:30:28Z

Just curious, what's next with this PR?
For now is there anything else I should do?

jreback · 2015-08-04T22:35:31Z

pandas/core/groupby.py

@@ -2608,6 +2608,8 @@ def _cython_agg_blocks(self, how, numeric_only=True):
        for block in data.blocks:

            values = block._try_operate(block.values)


I'd like to do this differently.

move the ._try_operate call to inside .aggregate (its line 1533). Then you can use do the iNaT to np.nan there

Ok,
Then I assume I would do the ._try_operate after checking is_datetime_or_timedelta_dtype not to lose dtype.
But ._try_operate on DatetimeBlock is just values.view('i8')
However the code just after is_datetime_or_timedelta_dtype is values = values.view('int64')
I assume both are not needed so I can remove ._try_operate
Hope this doesn't destroy something else.

jreback · 2015-08-08T12:24:31Z

pandas/tests/test_groupby.py

+    def test_first_last_max_min_on_time_data(self):
+        from datetime import timedelta as td
+        DF_dt_test=DataFrame({'dt':[nan,'2015-07-24 10:10','2015-07-25 11:11','2015-07-23 12:12',nan],'td':[nan,td(days=1) ,td(days=2),td(days=3),nan]})
+        DF_dt_test.dt=pd.to_datetime(DF_dt_test.dt)


wrap this line
use df as a variable name (or maybe df_test)

add the issue number as a comment

jreback · 2015-09-01T12:00:35Z

can you rebase / update

larvian · 2015-09-01T13:05:30Z

Sure... rebase done. Queued for Travis CI

For groupby the time stamps gets converted to integervalue tslib.iNaT which is -9223372036854775808. The aggregation is then done using this value with incorrect result as a consequence. The solution proposed here is to replace its value by np.nan in case it is a datetime or timedelta.

larvian · 2015-09-02T14:22:13Z

Rebased yesterday again and it passed Travis CI.
Today again I have "This branch has conflicts that must be resolved".
This seem to happen all the time due to changes to v0.17.0.txt .
It seems wasteful of Travis time to run CI just because updates due to changes to v0.17.0.txt.
How to avoid the constant conflicts due to changes in this file?
This time I try to insert my line above the last line.

jreback · 2015-09-02T14:32:06Z

to avoid conflicts, don't insert the line at the end of the notes, put it somewhere else above. This is why I have left lots of blank space (which gradually gets filled up).

jreback · 2015-09-02T14:33:11Z

ping when green.

larvian · 2015-09-03T04:19:19Z

This one is green now

BUG: Fix issue with incorrect groupby handling of NaT

jreback · 2015-09-03T13:10:44Z

thanks!

behzadnouri · 2015-09-05T19:27:17Z

this added line has broken cython aggregation. You cannot assign nan to an int64 view of an array.

before this patch:

In [1]: from string import ascii_lowercase

In [2]: np.random.seed(2718281)

In [3]: n = 1 << 21

In [4]: dr = date_range('2015-08-30', periods=n // 10, freq='T')

In [5]: df = DataFrame({
   ...:         '1st':np.random.choice(list(ascii_lowercase), n),
   ...:         '2nd':np.random.randint(0, 5, n),
   ...:         '3rd':np.random.choice(dr, n)})

In [6]: df.loc[np.random.choice(n, n // 10), '3rd'] = np.nan

In [7]: gr = df.groupby(['1st', '2nd'])

In [8]: %timeit gr.count()
The slowest run took 21.22 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 13.3 ms per loop

In [9]: %timeit gr.count()
100 loops, best of 3: 13.8 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+521.g207efc2'

with this patch:

In [8]: %timeit gr.count()
1 loops, best of 3: 144 ms per loop

In [9]: %timeit gr.count()
10 loops, best of 3: 149 ms per loop

In [10]: pd.__version__
Out[10]: '0.16.2+522.g9c2d1a6'

jreback · 2015-09-05T19:32:48Z

yes that should be iNaT

odd that the tests didn't pick this up

behzadnouri · 2015-09-05T19:39:49Z

@jreback it is modifying values which are already iNaT:

values[values == tslib.iNaT] = np.nan

jreback · 2015-09-05T19:42:13Z

this should already be a float here

jreback · 2015-09-05T20:57:10Z

ahh I see it is raising, being cause and taking the slow path...ok.thanks for the catch

larvian · 2015-09-06T06:47:05Z

@behzadnouri @jreback Thanks for pointing out the problem.
Is it that the cython aggregation function accepts int64 does not like the nan and raises exception?
Not sure how to deal with it. Should all the NaT be dropped instead before aggregating?
Are there some aggregations where the NaT are important?

larvian changed the title ~~BUG: Fix issue with incorrect groupby handling of NaT #10590~~ BUG: Fix issue with incorrect groupby handling of NaT Jul 19, 2015

sinhrks added Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 20, 2015

sinhrks added this to the 0.17.0 milestone Jul 20, 2015

jreback removed this from the 0.17.0 milestone Jul 20, 2015

larvian mentioned this pull request Jul 26, 2015

BUG: Fix issue with incorrect groupby handling of NaT #10590 larvian/pandas#2

Closed

larvian force-pushed the fix-issue-groupby-NaT branch 2 times, most recently from d716ba0 to f2759bc Compare July 31, 2015 22:47

jreback reviewed Aug 4, 2015
View reviewed changes

larvian force-pushed the fix-issue-groupby-NaT branch 2 times, most recently from 4a10c42 to 476891f Compare August 6, 2015 07:22

jreback reviewed Aug 8, 2015
View reviewed changes

larvian force-pushed the fix-issue-groupby-NaT branch 2 times, most recently from 0c4d1d9 to c200f66 Compare August 8, 2015 21:32

larvian force-pushed the fix-issue-groupby-NaT branch 2 times, most recently from a4d5cf8 to aa402ae Compare August 19, 2015 20:13

larvian force-pushed the fix-issue-groupby-NaT branch from aa402ae to b112b3a Compare August 22, 2015 05:13

larvian force-pushed the fix-issue-groupby-NaT branch from b112b3a to 1cf7b2f Compare September 1, 2015 12:57

larvian force-pushed the fix-issue-groupby-NaT branch from 1cf7b2f to 9c2d1a6 Compare September 2, 2015 14:13

jreback added this to the 0.17.0 milestone Sep 2, 2015

jreback added a commit that referenced this pull request Sep 3, 2015

Merge pull request #10625 from larvian/fix-issue-groupby-NaT

3991731

BUG: Fix issue with incorrect groupby handling of NaT

jreback merged commit 3991731 into pandas-dev:master Sep 3, 2015

kawochen mentioned this pull request Sep 4, 2015

REGR: if someone wants to find where this was fixed would be gr8 #10980

Closed

jreback mentioned this pull request Sep 5, 2015

PERF: NaT groups cause wrong path in grouping #11010

Closed

larvian deleted the fix-issue-groupby-NaT branch February 14, 2016 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix issue with incorrect groupby handling of NaT #10625

BUG: Fix issue with incorrect groupby handling of NaT #10625

larvian commented Jul 19, 2015

jreback commented Jul 19, 2015

larvian commented Jul 23, 2015

larvian commented Jul 24, 2015

max-sixty commented Jul 24, 2015

larvian commented Jul 24, 2015

jreback commented Jul 24, 2015

sinhrks commented Jul 24, 2015

larvian commented Jul 24, 2015

larvian commented Jul 25, 2015

TomAugspurger commented Jul 26, 2015

larvian commented Jul 26, 2015

larvian commented Aug 3, 2015

jreback Aug 4, 2015

larvian Aug 5, 2015

jreback Aug 8, 2015

jreback Aug 8, 2015

jreback commented Sep 1, 2015

larvian commented Sep 1, 2015

larvian commented Sep 2, 2015

jreback commented Sep 2, 2015

jreback commented Sep 2, 2015

larvian commented Sep 3, 2015

jreback commented Sep 3, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

jreback commented Sep 5, 2015

larvian commented Sep 6, 2015

		@@ -2608,6 +2608,8 @@ def _cython_agg_blocks(self, how, numeric_only=True):
		for block in data.blocks:

		values = block._try_operate(block.values)

BUG: Fix issue with incorrect groupby handling of NaT #10625

BUG: Fix issue with incorrect groupby handling of NaT #10625

Conversation

larvian commented Jul 19, 2015

jreback commented Jul 19, 2015

larvian commented Jul 23, 2015

larvian commented Jul 24, 2015

max-sixty commented Jul 24, 2015

larvian commented Jul 24, 2015

jreback commented Jul 24, 2015

sinhrks commented Jul 24, 2015

larvian commented Jul 24, 2015

larvian commented Jul 25, 2015

TomAugspurger commented Jul 26, 2015

larvian commented Jul 26, 2015

larvian commented Aug 3, 2015

jreback Aug 4, 2015

Choose a reason for hiding this comment

larvian Aug 5, 2015

Choose a reason for hiding this comment

jreback Aug 8, 2015

Choose a reason for hiding this comment

jreback Aug 8, 2015

Choose a reason for hiding this comment

jreback commented Sep 1, 2015

larvian commented Sep 1, 2015

larvian commented Sep 2, 2015

jreback commented Sep 2, 2015

jreback commented Sep 2, 2015

larvian commented Sep 3, 2015

jreback commented Sep 3, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

behzadnouri commented Sep 5, 2015

jreback commented Sep 5, 2015

jreback commented Sep 5, 2015

larvian commented Sep 6, 2015