Skip to content

Regr/period range large value/issue 36430 #36535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nrebena
Copy link
Contributor

@nrebena nrebena commented Sep 21, 2020

Checklist

Solution

Culprit was the multiplication unix_date * 24 * 3600 * 10**9 / factor, for

unix_date = 106752
np.log2(unix_date * 24 * 3600 * 10**9)
# 63.00000011936912

That probably lead to an integer overflow somewhere and the observed behaviours.

Splitting the multiplication did the trick.

@nrebena nrebena force-pushed the regr/period-range-large-value/issue-36430 branch from 591f209 to 2c3c15d Compare September 21, 2020 22:01
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for root causing - pretty tricky

@@ -33,9 +33,10 @@ Fixed regressions
- Fixed regression in :class:`IntegerArray` unary plus and minus operations raising a ``TypeError`` (:issue:`36063`)
- Fixed regression in :meth:`Series.__getitem__` incorrectly raising when the input was a tuple (:issue:`35534`)
- Fixed regression in :meth:`Series.__getitem__` incorrectly raising when the input was a frozenset (:issue:`35747`)
- Fixed regression in :meth:`read_excel` with ``engine="odf"`` caused ``UnboundLocalError`` in some cases where cells had nested child nodes (:issue:`36122`,:issue:`35802`)
- Fixed regression in :class:`DataFrame` and :class:`Series` comparisons between numeric arrays and strings (:issue:`35700`,:issue:`36377`)
- Fixed regression in :meth:`read_excel` with ``engine="odf"`` caused ``UnboundLocalError`` in some cases where cells had nested child nodes (:issue:`36122`, :issue:`35802`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this? Minor nit but it's confusing to include here; can just do a separate PR to clean whatsnew

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing.

@@ -886,7 +886,10 @@ cdef int64_t get_time_nanos(int freq, int64_t unix_date, int64_t ordinal) nogil:
# We must have freq == FR_HR
factor = 10**9 * 3600

sub = ordinal - unix_date * 24 * 3600 * 10**9 / factor
# Fix issue #36430
nanos_in_day = 24 * 3600 * 10**9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the appropriate suffixes to the constants here? This seems suspect that it would make a difference at all; wonder if the suffixes alone would fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean declaring this way: cdef const nanos_in_day = 24 * 3600 * 10**9 ?
This does not compile.

@WillAyd WillAyd added the Period Period data type label Sep 21, 2020
@jreback jreback added the Regression Functionality that used to work in a prior pandas version label Sep 21, 2020
@jreback
Copy link
Contributor

jreback commented Sep 21, 2020

looks good ping when addressed @WillAyd comments and green.

@dsaxton dsaxton added this to the 1.1.3 milestone Sep 22, 2020
@@ -886,7 +886,10 @@ cdef int64_t get_time_nanos(int freq, int64_t unix_date, int64_t ordinal) nogil:
# We must have freq == FR_HR
factor = 10**9 * 3600

sub = ordinal - unix_date * 24 * 3600 * 10**9 / factor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the trouble is that there is an overflow going on somewhere in this expression?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I am fairly certain that 24 * 3600 * 10**9 will overflow - these are likely interpreted by the compiler to just be of type int, but that multiplication could very well exceed the limits of an int type. Adding the ULL suffix I think would be ideal

More details on how decimal literals are assigned types here:
https://stackoverflow.com/a/41407498/621736

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using uint64_t instead of int64_t will work for the given example, but will then fail for date range earlier than the epoch with an integer overflow, so we must stick with signed integer here.
About how this change work, it just change the order of the operation so that unix_date is not multiplied by 24 * 3600 * 10**9, but by 24 * 3600 * 10**9 / factor, which is smaller and does not result into an integer overflow (except for value in the really far futur for the use case described in the issue, after the year 2*10**15)
So the real fix to do here is maybe just to add parenthesis in the right place, see new commit shortly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [64]: 24 * 3600 * 10**9                                                                                                                                
Out[64]: 86400000000000

In [65]: np.iinfo(np.int64).max                                                                                                                           
Out[65]: 9223372036854775807

this is a fairly standard number, i agree if multiplied by a large number this could overflow, but ok here.

@nrebena nrebena force-pushed the regr/period-range-large-value/issue-36430 branch from 2c3c15d to f0e1e27 Compare September 22, 2020 19:41
@nrebena
Copy link
Contributor Author

nrebena commented Sep 22, 2020

@jreback @WillAyd green and comment addressed, even if I am not sure about the const comment.

I also explicited in response to comment how the fix work, and why we should not be using unsigned long long here.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2020

@WillAyd anything further?

@WillAyd WillAyd merged commit 3ee6242 into pandas-dev:master Sep 22, 2020
@WillAyd
Copy link
Member

WillAyd commented Sep 22, 2020

Great thanks @nrebena - nice PR!

@simonjayhawkins
Copy link
Member

@meeseeksdev backport 1.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Period Period data type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: period_range giving incorrect values for large datetimes
6 participants