-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PDEP-4: consistent parsing of datetimes #48621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Dear @MarcoGorelli, Ahmet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a general question: How would this impact functions like read_csv
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 lgtm.
|
||
## Motivation and Scope | ||
|
||
Pandas date parsing is very flexibible, but arguably too much so - see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo on flexible
Pandas date parsing is very flexibible, but arguably too much so - see | ||
https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how | ||
much confusion this causes. Pandas can swap format midway, and though this | ||
is document, it regularly breaks users' expectations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documented
Thanks for taking a look - in read_csv, if parsing a date column fails, then the column is just kept as
So, if Example current behaviour
new behaviour
|
Thank you for this! It looks like a really good solution. I don't know the code well enough to know how feasible this is, but in case of format inference, would it be possible to inform the user of what format has been inferred? At least maybe in case of errors. The user might not realise that there is a |
Cheers! And yeah if there's an error and the user is using
|
if |
True, it would just show Maybe there could be a |
Would it make sense to add
|
I don't think that what is proposed here (failing to parse date strings that don't match the inferred format) is the behavior you would want all the time. It might be a good option to have, controlled by a flag parameter. I say that because the current behavior of The behavior proposed here is already available with the |
@scott-r Disagree with
The harm is silently inferring two different formats, when in reality there is only one. |
@lordgrenville To be clear, I used the phrase "never harmful" because Separate from that, I agree with you and @MarcoGorelli that sometimes you might want to enforce the restriction that all the date strings must adhere to a single format. That can be done today with the |
@scott-r the current behaviour is unexpected to many and causes problems, see the many comments in #12585 and the many linked issues The behaviour of And sure, mixed formats might be a valid use case, but they can still be parsed using
so I'd be -1 on adding another boolean argument
Thanks Jeff! I'll wait to see if others have objections, then I'll change the status to accepted |
- if no ``format`` is specified, ``pandas`` will guess the format from the first non-NaN row | ||
and parse the rest of the input according to that format. Errors will be handled | ||
according to the ``errors`` argument - there will be no silent switching of format; | ||
- ``infer_datetime_format`` will be deprecated; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you were parsing 11-12-10
, would the default be November 12, 2010, or December 11, 2010 or December 10, 2011? I think we should be explicit in this proposal on how dayfirst
and yearfirst
interacts with this particular behavior, especially since the docs say that those 2 parameters are not strict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know what infer_datetime_format
currently does but is it not very restrictive, and not always possible, to infer from just the first element?
If that were to be the case I support the comment of outlining how to operate with year first and day first. As a European my colleagues and I hate US mm/dd/yy and I think this might also need outlining some specifics.
The advantage of providing a consistent format input is that better inference could be made from multiple samples and this is very useful for structured data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both for taking a look!
@Dr-Irv this proposal wouldn't change how dayfirst
and yearfirst
operate. The format will try to be guessed in accordance with these parameters, just like it is on main
- the difference is that with this proposal, the format guessed from the first non-NaN row will be used to parse the rest of the Series
@attack68 in the rare case that it's not possible to guess the format from the first element, then a UserWarning would be raised, check lines 49-55 of this PR
You're very right to bring up mm/dd/yy
👍 - indeed the vast majority of the world doesn't use that format. That's why the current behaviour is so dangerous. For example, suppose your data is in %d-%m-%Y %H:%M
format:
On main
, the first row's date would be parsed as mm-dd-yyyy
, whilst the second one as dd-mm-yyyy
. No error, no warning, this is very easy to miss (and I almost did once in a prod setting 😳 ):
In [1]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00'])
Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)
With this PDEP, you could just check the format of your first row, and you'd know the rest of the Series was parsed in accordance to that. If it can't be, then with errors='raise'
(the default), you'd get an error
ValueError: time data '13-01-2000 00:00' does not match format '%m-%d-%Y %H:%M' (match)
and you'd see that the guessed format wasn't right. You could get around that either by explicitly passing format
, or with dayfirst=True
:
In [2]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00'], dayfirst=True)
Out[2]: DatetimeIndex(['2000-01-12', '2000-01-13'], dtype='datetime64[ns]', freq=None)
Totally agree on better documenting this, and that inference could be optimised by using multiple samples to guess - first, I just wanted to get agreement that we want to_datetime
to parse consistently
@MarcoGorelli What if we were to change the current function of |
Thanks for taking a look @srotondo I don't think that would fix the confusion - in #12585 and linked issues, people are using And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with Furthermore, I'd be -1 on adding yet another boolean argument thus increasing complexity even more |
What if you keep
So people with code that doesn't specify |
@MarcoGorelli I suppose if the current functionality is still available through |
@Dr-Irv sure, that'd be an option, but if there's gonna be a breaking change, I'd suggest we take the chance to simplify - having |
+1 on the big picture. Implementation-wise, is the idea to pretty much replace our usage of dateutil.parser? |
However, it's read as "1st of December, 13th of January". No warning or error is thrown. | ||
|
||
Currently, the only way to ensure consistent parsing is by explicitly passing | ||
``format=``. The argument ``infer_datetime_format`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor but related, but it would be good to mention that infer_datetime_format
should be strict with respect to being mixed with format
In [1]: pd.to_datetime(["2022-01-01"], infer_datetime_format=True, format="%Y-%m-%d")
Out[1]: DatetimeIndex(['2022-01-01'], dtype='datetime64[ns]', freq=None)
# Format doesn't match the input
In [2]: pd.to_datetime(["2022-01-01"], infer_datetime_format=True, format="%m-%d-%Y")
Out[2]: DatetimeIndex(['2022-01-01'], dtype='datetime64[ns]', freq=None)
i.e. it's not great that format != None
and infer_datetime_format=True
If
|
Thanks
I'm also on board with the suggestion to make |
Thanks all for your feedback, I'll incorporated some points into the document There's been a couple of approvals, so for now I've changed the status to accepted |
There's been 3 explicit approvals from core members, no "requested changes", and this has been open all week, so merging now Thanks all for the discussion, aiming to start working on this soon-ish |
I couldn't review before merged, but happy with the proposal here, really nice improvement. I'm curious what's the suggested workaround mentioned in the PDEP for parsing columns with different formats. Should we mention it in the PDEP? Or the idea is to write it in the logs when implemented? Also, seems like few formats are broken when rendered: https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html |
Thanks - yeah it's mentioned at the end of "detailed description"
|
One way I read this question (and not sure if it's the right way) is if a frame has two columns each with a different (but self-consistent) format. They would be parsed individually, is that correct @MarcoGorelli? |
Ah apologies, I'd misunderstood I presume you mean in data = io.StringIO('13-01-2000 00:00,13 Jan 2000\n14-01-2000 00:00,14 Jan 2000\n')
print(pd.read_csv(data, header=None, parse_dates=[0, 1])) would give
Each columns were parsed according to the format inferred from its respective first row. If the first row of the first column had been This wouldn't deviate from the current behaviour |
Actually what I had in mind was answered in the first answer, I don't think my question was very clear. I missed the workaround when reading the PDEP, sorry. |
## Abstract | ||
|
||
The suggestion is that: | ||
- ``to_datetime`` becomes strict and uses the same datetime format to parse all elements in its input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a blank line before the list, that's markdown standard (GitHub comments allows it, but it's not allowed in the markdown spec)
|
||
The whatsnew notes read | ||
|
||
> In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we don't have style for blockquote
in our website. I created #48758 to fix it.
* pdep-4: initial draft * note about making guess_datetime_format public, out of scope work, dayfirst/yearfirst Co-authored-by: MarcoGorelli <>
@MarcoGorelli was PDEP-4 fully implemented? Should we change its state to Also, I see there was a revision. Would it make sense to add the link to the revision PR to |
Here's my proposal to address #12585