Skip to content

BUG: first() changes datetime64 data #9311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chrisbyboston opened this issue Jan 20, 2015 · 19 comments · Fixed by #9345
Closed

BUG: first() changes datetime64 data #9311

chrisbyboston opened this issue Jan 20, 2015 · 19 comments · Fixed by #9345
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Milestone

Comments

@chrisbyboston
Copy link

I have a dataframe that contains a datetime64 column that seems to be losing some digits after a groupby.first(). I know that under the hood these are stored at nanosecond precision, but I need them to remain equal, since I'm merging back onto the original frame later on. Is this expected behavior?

import pandas as pd
from pandas import Timestamp

data = [
 {'a': 1,
  'dateCreated': Timestamp('2011-01-20 12:50:28.593448')},
 {'a': 1,
  'dateCreated': Timestamp('2011-01-15 12:50:28.502376')},
 {'a': 1,
  'dateCreated': Timestamp('2011-01-15 12:50:28.472790')},
 {'a': 1,
  'dateCreated': Timestamp('2011-01-15 12:50:28.445286')}]

df = pd.DataFrame(data)

Output is:

In [6]: df
Out[6]:
   a                dateCreated
0  1 2011-01-20 12:50:28.593448
1  1 2011-01-15 12:50:28.502376
2  1 2011-01-15 12:50:28.472790
3  1 2011-01-15 12:50:28.445286

In [7]: df.groupby('a').first()
Out[7]:
                    dateCreated
a
1 2011-01-20 12:50:28.593447936

When I compare the datetime64 in the first row to the datetime64 after the groupby.first(), the two are not equal.

@shoyer
Copy link
Member

shoyer commented Jan 21, 2015

This looks like a bug to me.

I can reproduce this on master.

@chrisbyboston
Copy link
Author

I have isolated it to this line.

https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L1818

I'm going to see if I can determine if this is a numpy bug or if we just need to pass some additional parameters to numpy.ndarray.astype.

@chrisbyboston
Copy link
Author

Interesting...

numpy casting defaults to "unsafe" for backward compatibility reasons. When I pass in "safe":

result = result.astype('M8[ns]', casting='safe')

TypeError: Cannot cast array from dtype('float64') to dtype('<M8[ns]') according to the rule 'safe'

So numpy is even aware that this can't be done safely.

@chrisbyboston
Copy link
Author

If I leave casting="safe", then all is well, as it raises an error and fails over safely:

def _groupby_function(name, alias, npfunc, numeric_only=True,
                      _convert=False):
    def f(self):
        self._set_selection_from_grouper()
        try:
            return self._cython_agg_general(alias, numeric_only=numeric_only)
        except AssertionError as e:
            raise SpecificationError(str(e))
        except Exception:
            result = self.aggregate(lambda x: npfunc(x, axis=self.axis))
            if _convert:
                result = result.convert_objects()
            return result

    f.__doc__ = "Compute %s of group values" % name
    f.__name__ = name

    return f

Result:

   a                dateCreated
0  1 2011-01-20 12:50:28.593448
1  1 2011-01-15 12:50:28.502376
2  1 2011-01-15 12:50:28.472790
3  1 2011-01-15 12:50:28.445286


                 dateCreated
a
1 2011-01-20 12:50:28.593448

I'll work up a PR and a test... and see if I break all the things.

@shoyer
Copy link
Member

shoyer commented Jan 21, 2015

The real question is why this is being cast to float64 in the first place. We might need to duplicate an internal cython routine so that there is an int64 version that we can safely use without needing any casting.

@chrisbyboston
Copy link
Author

@shoyer I'll take a peek at when that gets converted to float64. I didn't catch that at all, but you're right. It should be able to safely convert an int64.

@chrisbyboston
Copy link
Author

@chrisbyboston
Copy link
Author

Alright, looks like the cython in generated.pyx needs to have functions for group_nth_ for int8, int16, int32, and int64.

Sound right to you @shoyer?

@shoyer
Copy link
Member

shoyer commented Jan 21, 2015

@iwschris That file is generated from pandas/src/generate_code.py, and it looks like integers were explicitly excluded, probably because something did not think they were necessary for groupby functions: https://github.com/pydata/pandas/blob/master/pandas/src/generate_code.py#L2407

@chrisbyboston
Copy link
Author

Cool. I have a branch started that should fix this, and I'll enable integers for now, and see how things go from there.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

@iwschris this would be a nice fix

@jreback jreback added this to the 0.16.0 milestone Jan 22, 2015
@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Jan 22, 2015
@chrisbyboston
Copy link
Author

First Attempt: Don't convert Integers To Floats in Cython

So, this is tightly wound around the idea that there is no integer nan. Once I enabled integers in generate_code.py, the resulting generated.pyx simply converted integers to float64, which is precisely the thing we're trying to solve here.

I modified the cython templates so that integers could remain integers, but obviously the problem of how to represent integer nan still remained. Just to see if it solved the problem at hand, on the integer groupby templates, I used 0 for nan, and the immediate problem was solved. Using 0 for nan is not an acceptable solution though, and I broke about 20 tests, where the integer to float64 conversion has already been assumed.

Suggested Solution: Coerce safely with a fall-back

This prioritizes data safety over performance, but only in a very narrow band where we know that there is an issue. I'm proposing that we change this line in core/internals.py to this:

result = result.astype('M8[ns]', casting='safe')

All the fallback code already exists, and the change breaks no tests.

Thoughts?

@shoyer
Copy link
Member

shoyer commented Jan 22, 2015

@iwschris What does the fallback code look like? I'm guessing that skips Cython entirely?

I would rather be correct than fast, so that would probably be OK, but we should know what the performance impact is first. Are there any major regressions in the vbench performance suite? (see here for instructions)

Our usual approach to handling missing values for datetime64 is to check for values equal to iNaT. For example, take a look at the group_count functions in generate_code.py. This would be the preferred solution.

@chrisbyboston
Copy link
Author

@shoyer that's right, it skips Cython.

I'll do the vbench stuff and post the results here.

As far as iNaT goes, the issue here is that by the time we get into those Cython functions the datetime64 has already been converted to an Integer.

@shoyer
Copy link
Member

shoyer commented Jan 22, 2015

@iwschris it's actually probably fine to do iNaT comparisons even for non-datetime types. You'll notice some of the existing code does exactly that. iNaT is the smallest int64 number:

In [2]: pd.tslib.iNaT
Out[2]: -9223372036854775808

@chrisbyboston
Copy link
Author

Excellent. I'll check it out then. Thanks!

@chrisbyboston
Copy link
Author

Hah... the relevant part is here:

timeseries_custom_bday_cal_decr              |   0.0296 |   0.0197 |   1.5040 |
stat_ops_series_std                          |   0.6110 |   0.3927 |   1.5560 |
strings_title                                |  10.8200 |   6.6644 |   1.6236 |
groupby_last_float64                         |   6.6760 |   2.6947 |   2.4775 |
groupby_multi_count                          | 978.3040 |   5.6086 | 174.4281 |
groupby_last_datetimes                       | 2054.3030 |   9.3013 | 220.8609 |
groupby_first_datetimes                      | 2042.0907 |   8.5020 | 240.1893 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

...point taken. I'll keep working on the Cython :)

@chrisbyboston
Copy link
Author

Cython is fixed, all tests passing, vbench is looking good as well. I'll have the PR in today.

@chrisbyboston
Copy link
Author

Here it is: #9345

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants