Skip to content

BUG: Fix Timestamp type checks to work with subclassed datetime (#25851) #25853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Apr 5, 2019

Conversation

ArtificialQualia
Copy link
Contributor

@ArtificialQualia ArtificialQualia commented Mar 23, 2019

PyDateTime_CheckExact is being used as a proxy for isinstance(obj, Timestamp) checks for performance reasons. However, if a subclass of the std datetime lib is being used, these checks are not sufficient to determine if an object is a Timestamp or not.

As discussed in #25746, any solution will be less performant than PyDateTime_CheckExact. The best solution found was to use isinstance(obj, _Timestamp), and that is what I implemented here.

Of course, a few additional changes were necessary to be able to use that check.

Since asv isn't very good for cython code, I've done performance testing on all the modified functions by running them 10,000 times in a function, then timing that function in %timeit. The results are below:

                       ----------- PR ---------         -------- Master -------
                        datetime     timestamp           datetime    timestamp
array_to_datetime        34.3 ms       34.5 ms            34.2 ms      34.5 ms
localize_pydatetime        15 ms       67.7 ms            15.2 ms      57.5 ms
normalize_date           12.7 ms         91 ms            12.8 ms      78.3 ms
Timedelta addition       1.05 ms         23 ms            1.04 ms      22.9 ms
Timestamp constructor    11.1 ms       9.85 ms            10.9 ms      9.79 ms
tz_localize                  N/A       63.7 ms                N/A      53.5 ms
normalize                    N/A       90.6 ms                N/A      78.5 ms

As you can see, some of the functions used with a Timestamp are affected, with a few unaffected. Where there is an impact, it appears to be 15-20% less performant.

Note this fix also fixes (potentially) #25734, but if we want a separate test case for that issue I could modify my existing PR for that.

@pep8speaks
Copy link

pep8speaks commented Mar 23, 2019

Hello @ArtificialQualia! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-04-05 15:33:46 UTC

@codecov
Copy link

codecov bot commented Mar 23, 2019

Codecov Report

Merging #25853 into master will decrease coverage by 49.63%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25853       +/-   ##
===========================================
- Coverage   91.45%   41.81%   -49.64%     
===========================================
  Files         172      172               
  Lines       52892    52892               
===========================================
- Hits        48373    22119    -26254     
- Misses       4519    30773    +26254
Flag Coverage Δ
#multiple ?
#single 41.81% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/_libs/tslibs/__init__.py 100% <100%> (ø) ⬆️
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.36%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
... and 130 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e0f9a9...4447746. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 23, 2019

Codecov Report

Merging #25853 into master will decrease coverage by 0.12%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25853      +/-   ##
==========================================
- Coverage   40.85%   40.72%   -0.13%     
==========================================
  Files         175      175              
  Lines       52552    52551       -1     
==========================================
- Hits        21468    21401      -67     
- Misses      31084    31150      +66
Flag Coverage Δ
#single 40.72% <100%> (-0.13%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/datetimelike.py 42.73% <100%> (+0.1%) ⬆️
pandas/io/gbq.py 25% <0%> (-50%) ⬇️
pandas/io/formats/format.py 30.44% <0%> (-4.14%) ⬇️
pandas/io/formats/csvs.py 67.06% <0%> (-1.2%) ⬇️
pandas/core/internals/blocks.py 51.2% <0%> (-1.11%) ⬇️
pandas/util/testing.py 48.26% <0%> (-1.06%) ⬇️
pandas/core/dtypes/cast.py 49% <0%> (-0.34%) ⬇️
pandas/core/series.py 45.66% <0%> (-0.21%) ⬇️
pandas/core/frame.py 34.51% <0%> (-0.18%) ⬇️
pandas/core/arrays/datetimes.py 65.93% <0%> (-0.16%) ⬇️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b03130...ded5e69. Read the comment docs.

@gfyoung gfyoung added Datetime Datetime data dtype Internals Related to non-user accessible pandas implementation labels Mar 25, 2019
@ArtificialQualia
Copy link
Contributor Author

below comments identify areas where _Timestamp has changed to prevent having to have circular imports.

@ArtificialQualia
Copy link
Contributor Author

@jbrockmendel separated out _Timestamp as you suggested. Obviously had to make a number of other changes to support that, let me know if they are acceptable, or what other changes you'd like.

Didn't mean to make most of those comments 'reviews', but they do need to be reviewed anyways I suppose!

Helper to create a Timedelta so that the
Timedelta class doesn't have to be imported elsewhere
"""
return Timedelta(*args, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather just have the runtime import in TImedelta.resolution (I think the only place this is used). Or for that matter just move resolution from _Timestamp to Timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in two places in _Timestamp, _Timestamp.__sub__ and _Timestamp.resolution. If we want to go down the runtime import path, we could either add the Timedelta import in both functions in _Timestamp, or add a _Timestamp import into timedeltas._binary_op_method_timedeltalike

As for moving the _Timestamp functions to Timestamp, _Timestamp.resolution can be moved with no issues. However, _Timestamp.__sub__ can not. Not sure where the problem is there, but if you want I can try to figure that one out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right that's what I mean, it doens't need to be in pandas/_libs/tslibs/__init__.py, rather just import it directly where needed. its ok where its living.

elif PyDelta_Check(other):
nanos = (other.days * 24 * 60 * 60 * 1000000 +
other.seconds * 1000000 +
other.microseconds) * 1000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, this comment got dropped again due to 'review'. I've ensured this was the last missing comment I had.

As comment above implies, this logic was copied from delta_to_nanoseconds, has also been cleaned up of a few unnecessary ifs.

delta_to_nanoseconds is a very independent function though, and if we want to move that to a separate file that would work too.

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

@ArtificialQualia can you merge master and rename _timestamp.pyx -> c_timestamp.pyx, otherwise lgtm.

@ArtificialQualia
Copy link
Contributor Author

Master merged and file renamed.

Build seems to be failing because of google's cloud API? Not sure what that is about.

I think this merge might want to wait on #25938 so that function can be imported and used rather than duplicating logic in _Timestamp.

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

@ArtificialQualia yeah don't worry about the gbq failure.

ok will have you rebase after # 25938

@jreback jreback added this to the 0.25.0 milestone Apr 5, 2019
@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

@ArtificialQualia if you'd merge master; ping on green.

@ArtificialQualia
Copy link
Contributor Author

@jreback merged master and green

@@ -8,3 +8,4 @@
from .timestamps import Timestamp
from .timedeltas import delta_to_nanoseconds, ints_to_pytimedelta, Timedelta
from .tzconversion import tz_convert_single
from .c_timestamp import maybe_integer_op_deprecated # isort:skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this function here? this should not be exposed to the rest of the codebase (or directly imported if it actually is needed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just expose it in timestamp.pyx itself to make it easier to import

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was exposed previously, and is only used in pandas/core/arrays/datetimelike.py. _Timestamp itself makes use of it, so it has to be in c_timestamp.pyx otherwise we would have a circular import.

I can change the import in datetimelike.py to directly import it if you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry made the comment in the wrong place. yes just import it directly in dateimelike.py and don't put it in tslibs/__init__.py

int64_t value, nanosecond
object freq
list _date_attributes
cpdef bint _get_start_end_field(self, str field)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future PR we should rename the 'private' functions to public if they are actually exposed (the _ leading ones)

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

lgtm. ping on green!

@ArtificialQualia
Copy link
Contributor Author

@jreback green

@jreback jreback merged commit 181f972 into pandas-dev:master Apr 5, 2019
@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

thanks @ArtificialQualia nice patch! keep em coming!

@mroeschke
Copy link
Member

mroeschke commented Apr 5, 2019

Still a -1 to backport this to 0.24.x @jreback? My coworkers could use this patch to upgrade to 0.24 as the regression was introduced recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix PyDateTime_CheckExact checks as proxy for Timestamp checks in cython files
6 participants