BUG: first() changes datetime64 data #9311

chrisbyboston · 2015-01-20T16:52:32Z

I have a dataframe that contains a datetime64 column that seems to be losing some digits after a groupby.first(). I know that under the hood these are stored at nanosecond precision, but I need them to remain equal, since I'm merging back onto the original frame later on. Is this expected behavior?

import pandas as pd
from pandas import Timestamp

data = [
 {'a': 1,
  'dateCreated': Timestamp('2011-01-20 12:50:28.593448')},
 {'a': 1,
  'dateCreated': Timestamp('2011-01-15 12:50:28.502376')},
 {'a': 1,
  'dateCreated': Timestamp('2011-01-15 12:50:28.472790')},
 {'a': 1,
  'dateCreated': Timestamp('2011-01-15 12:50:28.445286')}]

df = pd.DataFrame(data)

Output is:

In [6]: df
Out[6]:
   a                dateCreated
0  1 2011-01-20 12:50:28.593448
1  1 2011-01-15 12:50:28.502376
2  1 2011-01-15 12:50:28.472790
3  1 2011-01-15 12:50:28.445286

In [7]: df.groupby('a').first()
Out[7]:
                    dateCreated
a
1 2011-01-20 12:50:28.593447936

When I compare the datetime64 in the first row to the datetime64 after the groupby.first(), the two are not equal.

shoyer · 2015-01-21T10:00:55Z

This looks like a bug to me.

I can reproduce this on master.

chrisbyboston · 2015-01-21T13:14:09Z

I have isolated it to this line.

https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L1818

I'm going to see if I can determine if this is a numpy bug or if we just need to pass some additional parameters to numpy.ndarray.astype.

chrisbyboston · 2015-01-21T16:25:49Z

Interesting...

numpy casting defaults to "unsafe" for backward compatibility reasons. When I pass in "safe":

result = result.astype('M8[ns]', casting='safe')

TypeError: Cannot cast array from dtype('float64') to dtype('<M8[ns]') according to the rule 'safe'

So numpy is even aware that this can't be done safely.

chrisbyboston · 2015-01-21T16:51:51Z

If I leave casting="safe", then all is well, as it raises an error and fails over safely:

def _groupby_function(name, alias, npfunc, numeric_only=True,
                      _convert=False):
    def f(self):
        self._set_selection_from_grouper()
        try:
            return self._cython_agg_general(alias, numeric_only=numeric_only)
        except AssertionError as e:
            raise SpecificationError(str(e))
        except Exception:
            result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
            if _convert:
                result = result.convert_objects()
            return result

    f.__doc__ = "Compute %s of group values" % name
    f.__name__ = name

    return f

Result:

   a                dateCreated
0  1 2011-01-20 12:50:28.593448
1  1 2011-01-15 12:50:28.502376
2  1 2011-01-15 12:50:28.472790
3  1 2011-01-15 12:50:28.445286


                 dateCreated
a
1 2011-01-20 12:50:28.593448

I'll work up a PR and a test... and see if I break all the things.

shoyer · 2015-01-21T19:12:27Z

The real question is why this is being cast to float64 in the first place. We might need to duplicate an internal cython routine so that there is an int64 version that we can safely use without needing any casting.

chrisbyboston · 2015-01-21T19:22:17Z

@shoyer I'll take a peek at when that gets converted to float64. I didn't catch that at all, but you're right. It should be able to safely convert an int64.

chrisbyboston · 2015-01-21T19:39:15Z

It runs an ensure_float on this line: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1508

chrisbyboston · 2015-01-21T21:42:20Z

Alright, looks like the cython in generated.pyx needs to have functions for group_nth_ for int8, int16, int32, and int64.

Sound right to you @shoyer?

shoyer · 2015-01-21T22:11:20Z

@iwschris That file is generated from pandas/src/generate_code.py, and it looks like integers were explicitly excluded, probably because something did not think they were necessary for groupby functions: https://github.com/pydata/pandas/blob/master/pandas/src/generate_code.py#L2407

chrisbyboston · 2015-01-21T22:55:24Z

Cool. I have a branch started that should fix this, and I'll enable integers for now, and see how things go from there.

jreback · 2015-01-22T02:12:34Z

@iwschris this would be a nice fix

chrisbyboston · 2015-01-22T12:58:18Z

First Attempt: Don't convert Integers To Floats in Cython

So, this is tightly wound around the idea that there is no integer nan. Once I enabled integers in generate_code.py, the resulting generated.pyx simply converted integers to float64, which is precisely the thing we're trying to solve here.

I modified the cython templates so that integers could remain integers, but obviously the problem of how to represent integer nan still remained. Just to see if it solved the problem at hand, on the integer groupby templates, I used 0 for nan, and the immediate problem was solved. Using 0 for nan is not an acceptable solution though, and I broke about 20 tests, where the integer to float64 conversion has already been assumed.

Suggested Solution: Coerce safely with a fall-back

This prioritizes data safety over performance, but only in a very narrow band where we know that there is an issue. I'm proposing that we change this line in core/internals.py to this:

result = result.astype('M8[ns]', casting='safe')

All the fallback code already exists, and the change breaks no tests.

Thoughts?

shoyer · 2015-01-22T18:25:05Z

@iwschris What does the fallback code look like? I'm guessing that skips Cython entirely?

I would rather be correct than fast, so that would probably be OK, but we should know what the performance impact is first. Are there any major regressions in the vbench performance suite? (see here for instructions)

Our usual approach to handling missing values for datetime64 is to check for values equal to iNaT. For example, take a look at the group_count functions in generate_code.py. This would be the preferred solution.

chrisbyboston · 2015-01-22T18:45:06Z

@shoyer that's right, it skips Cython.

I'll do the vbench stuff and post the results here.

As far as iNaT goes, the issue here is that by the time we get into those Cython functions the datetime64 has already been converted to an Integer.

shoyer · 2015-01-22T18:54:41Z

@iwschris it's actually probably fine to do iNaT comparisons even for non-datetime types. You'll notice some of the existing code does exactly that. iNaT is the smallest int64 number:

In [2]: pd.tslib.iNaT
Out[2]: -9223372036854775808

chrisbyboston · 2015-01-22T19:00:10Z

Excellent. I'll check it out then. Thanks!

chrisbyboston · 2015-01-22T23:48:05Z

Hah... the relevant part is here:

timeseries_custom_bday_cal_decr              |   0.0296 |   0.0197 |   1.5040 |
stat_ops_series_std                          |   0.6110 |   0.3927 |   1.5560 |
strings_title                                |  10.8200 |   6.6644 |   1.6236 |
groupby_last_float64                         |   6.6760 |   2.6947 |   2.4775 |
groupby_multi_count                          | 978.3040 |   5.6086 | 174.4281 |
groupby_last_datetimes                       | 2054.3030 |   9.3013 | 220.8609 |
groupby_first_datetimes                      | 2042.0907 |   8.5020 | 240.1893 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

...point taken. I'll keep working on the Cython :)

chrisbyboston · 2015-01-23T16:22:31Z

Cython is fixed, all tests passing, vbench is looking good as well. I'll have the PR in today.

chrisbyboston · 2015-01-23T16:36:29Z

Here it is: #9345

shoyer added Bug Groupby labels Jan 21, 2015

jreback added this to the 0.16.0 milestone Jan 22, 2015

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Jan 22, 2015

chrisbyboston mentioned this issue Jan 25, 2015

BUG: Fixes GH9311 groupby on datetime64 #9345

Merged

jreback mentioned this issue Jan 25, 2015

groupby first can return values not in group #9300

Closed

jreback closed this as completed in #9345 Feb 14, 2015

jreback mentioned this issue Mar 3, 2015

DataFrame.groupby() causes loss of precision on large integer values #5260

Closed

jreback mentioned this issue Apr 2, 2015

groupby() changes values of pandas.Timestamp (pandas 0.15.2) #9788

Closed

jreback mentioned this issue Sep 3, 2015

Unexpected behaviour when grouping datetime column containing null-values, SeriesGroupby #10979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: first() changes datetime64 data #9311

BUG: first() changes datetime64 data #9311

chrisbyboston commented Jan 20, 2015

shoyer commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

shoyer commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

shoyer commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

jreback commented Jan 22, 2015

chrisbyboston commented Jan 22, 2015

shoyer commented Jan 22, 2015

chrisbyboston commented Jan 22, 2015

shoyer commented Jan 22, 2015

chrisbyboston commented Jan 22, 2015

chrisbyboston commented Jan 22, 2015

chrisbyboston commented Jan 23, 2015

chrisbyboston commented Jan 23, 2015

BUG: first() changes datetime64 data #9311

BUG: first() changes datetime64 data #9311

Comments

chrisbyboston commented Jan 20, 2015

shoyer commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

shoyer commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

shoyer commented Jan 21, 2015

chrisbyboston commented Jan 21, 2015

jreback commented Jan 22, 2015

chrisbyboston commented Jan 22, 2015

First Attempt: Don't convert Integers To Floats in Cython

Suggested Solution: Coerce safely with a fall-back

shoyer commented Jan 22, 2015

chrisbyboston commented Jan 22, 2015

shoyer commented Jan 22, 2015

chrisbyboston commented Jan 22, 2015

chrisbyboston commented Jan 22, 2015

chrisbyboston commented Jan 23, 2015

chrisbyboston commented Jan 23, 2015