-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: Parse certain dates in Cython instead of falling back to dateutil.parse #25922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Performance benefits, shown by
|
Codecov Report
@@ Coverage Diff @@
## master #25922 +/- ##
==========================================
- Coverage 91.8% 91.79% -0.01%
==========================================
Files 174 174
Lines 52536 52536
==========================================
- Hits 48231 48227 -4
- Misses 4305 4309 +4
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25922 +/- ##
==========================================
- Coverage 91.99% 91.98% -0.01%
==========================================
Files 175 175
Lines 52383 52383
==========================================
- Hits 48188 48185 -3
- Misses 4195 4198 +3
Continue to review full report at Codecov.
|
xref #12667 I like the spirit of of this PR, but i am a little weary of the creep in scope. pandas does have some date parsing utility especially for iso dates, but I feel that more niche formats should be dispatched to dateutil i.e. maybe dateutil should inherit this performance improvement Thoughts @jbrockmendel? |
I agree with @mroeschke about creep scope. Do we have strong reason to believe that performance on this particular set of formats is a major pain point for users? |
As far as I can tell, As for being a pain point - this format is quite widely used in certain real-world data sets, like Mortgage or NYC Taxi Rides, so yes, it is a pain point in some cases (Mortgage dataset that takes dozens of minutes to be loaded because of this format being parsed slowly is the reality I measured). |
dateutil is pure-python, but it also has lots of room for performance improvement. Also IIRC @pganssle has been considering adding C/cython/rust backend(s).
That logic can apply to many formats, most of which we don't want to be responsible for internally.
I think if you pass the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there cases where this this fails to parse but we still call dateuti.parse? e.g. which are they (add to the doc-string).
certaily ok with parsing common cases, just want to note it.
There's no direct way of passing
So I'm stating that using |
0123245
to
8a05e93
Compare
Does this not need additional tests? I can't comment on the scoping stuff, but one of the hardest parts of changing any auto-magical interface is that it's very hard to tell if the things you are doing are going to break something someone is counting on. Even something like this with a fast-path, you should probably be doing a decent amount of testing of the desired behavior and similar, related behavior - hard to even say what that would do. Since you are trying to replace |
Great question, it most likely does need tests, we'll add some. |
I have added tests and fixed bug. Thanks @pganssle |
Glad that it helped to add tests. Maybe it's a job for a different PR, but I do still think it may be worth adding hypothesis-based tests testing that |
Note that this change changes the behaviour for dates like pandas/pandas/_libs/tslibs/parsing.pyx Lines 195 to 298 in 4814a28
It never replaces |
@vnlitvin While I think defaulting to "today" was not a good choice in If you are going to make a backwards-incompatible change to the behavior of the library, that should be very prominently documented. This change is only documented as a performance improvement. |
That is because I didn't suspect that |
…eted as a float number, on the other hand 09.2019 can be parsed as date
It turned out that there is no change in behaviour
This pattern is already handled by _parse_delimited_date
…ypothesis_delimited_date
…Y%m%d', '%y%m%d' for test_hypothesis_delimited_date; created _helper_hypothesis_delimited_date func
… -> test_datetime
…ar < 1000 not processig in _parse_delimited_date now
dbf26e3
to
2cd971a
Compare
@jreback Done & green |
thanks @vnlitvin and @anmyachev nice patch! these involved can take some time to review (and write of course). |
This tremendously speeds up parsing of certain date patterns:
,
MM/YYYY
,MM/DD/YYYY
andDD/MM/YYYY
(where separator is one of/
,-
, and\
), as previously they were parsed usingdateutil.parse
which is implemented in Python.git diff upstream/master -u -- "*.py" | flake8 --diff