PDEP-4: consistent parsing of datetimes #48621

MarcoGorelli · 2022-09-18T13:36:32Z

Here's my proposal to address #12585

ahmetanildindar · 2022-09-18T15:22:32Z

Dear @MarcoGorelli,
I have read your commit and it seems ok to me.

Ahmet

phofl

Just a general question: How would this impact functions like read_csv?

jreback

+1 lgtm.

jreback · 2022-09-18T15:39:28Z

web/pandas/pdeps/0004-consistent-to-datetime-parsing.md

+
+## Motivation and Scope
+
+Pandas date parsing is very flexibible, but arguably too much so - see


typo on flexible

jreback · 2022-09-18T15:39:48Z

web/pandas/pdeps/0004-consistent-to-datetime-parsing.md

+Pandas date parsing is very flexibible, but arguably too much so - see
+https://github.com/pandas-dev/pandas/issues/12585 and linked issues for how
+much confusion this causes. Pandas can swap format midway, and though this
+is document, it regularly breaks users' expectations.


MarcoGorelli · 2022-09-18T15:50:12Z

Just a general question: How would this impact functions like read_csv?

Thanks for taking a look - in read_csv, if parsing a date column fails, then the column is just kept as object

If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type

So, if read_csv would previously have parsed a column whilst switching format midway, with this PDEP it wouldn't parse it and would return it unaltered as object data type

Example

current behaviour
Swaps format midway

In [2]: pd.read_csv(io.StringIO('12-01-2000 00:00\n13-01-2000 00:00'), header=None, parse_dates=[0])[0]
Out[2]:
0   2000-12-01
1   2000-01-13
Name: 0, dtype: datetime64[ns]

new behaviour
Can't parse according to first row, so returns column unaltered as object

In [2]: pd.read_csv(io.StringIO('12-01-2000 00:00\n13-01-2000 00:00'), header=None, parse_dates=[0])[0]
Out[2]:
0    12-01-2000 00:00
1    13-01-2000 00:00
Name: 0, dtype: object

lordgrenville · 2022-09-18T18:13:37Z

Thank you for this! It looks like a really good solution.

I don't know the code well enough to know how feasible this is, but in case of format inference, would it be possible to inform the user of what format has been inferred? At least maybe in case of errors. The user might not realise that there is a ['12-01-2000 00:00:00', '13-01-2000 00:00:00'] lurking somewhere in the data.

MarcoGorelli · 2022-09-18T18:32:31Z

Cheers! And yeah if there's an error and the user is using errors='coerce' (the default), then pd.to_datetime(['12-01-2000 00:00:00', '13-01-2000 00:00:00']) would give

ValueError: time data '13-01-2000 00:00:00' does not match format '%m-%d-%Y %H:%M:%S' (match)

jreback · 2022-09-18T18:41:11Z

if
errors=coerce then this wouldn't raise

MarcoGorelli · 2022-09-18T18:55:55Z

True, it would just show DatetimeIndex(['2000-12-01', 'NaT'], dtype='datetime64[ns]', freq=None)

Maybe there could be a verbose argument which prints out the inferred format, but I'd suggest we keep that to a separate discussion. For now, I think that if people just call .head() on the output, they can tell if the format has been inferred correctly

rhshadrach · 2022-09-18T21:33:32Z

Maybe there could be a verbose argument which prints out the inferred format, but I'd suggest we keep that to a separate discussion. For now, I think that if people just call .head() on the output, they can tell if the format has been inferred correctly

Would it make sense to add guess_datetime_format to the public API instead (possibly renamed)?

from pandas._libs.tslibs.parsing import guess_datetime_format

print(guess_datetime_format('12-01-2000 00:00:00'))

# %m-%d-%Y %H:%M:%S

scott-r · 2022-09-18T23:43:35Z

I don't think that what is proposed here (failing to parse date strings that don't match the inferred format) is the behavior you would want all the time. It might be a good option to have, controlled by a flag parameter.

I say that because the current behavior of infer_datetime_format is never harmful. It can speed up the parsing of a long list of dates, but it will never cause errors from date strings that could have been parsed correctly without this option. In real-world cases where the date format may not be 100% consistent, I think you want the parsing to succeed whenever it can. If it's really true that infer_datetime_format is purely a performance boost with no behavior change (and this is true as far as I know), maybe it should be enabled all the time (i.e. remove the option) or at least make it true by default.

The behavior proposed here is already available with the format option, but only if you already know what the format is. I can see where it would be valuable to be able to strictly enforce an unknown (inferred) format. So perhaps this behavior could be enabled with a flag such as infer_strict_format (default false). I can see the potential for confusion with the existing infer_datetime_format but that problem could be ameliorated, as suggested above, by removing the existing flag.

lordgrenville · 2022-09-19T06:43:20Z

@scott-r Disagree with

...the current behavior of infer_datetime_format is never harmful

The harm is silently inferring two different formats, when in reality there is only one.

scott-r · 2022-09-19T08:28:52Z

@scott-r Disagree with

...the current behavior of infer_datetime_format is never harmful

The harm is silently inferring two different formats, when in reality there is only one.

@lordgrenville To be clear, I used the phrase "never harmful" because infer_datetime_format controls a performance optimization that will not produce any more, or fewer, parsing errors than there would be if it were not used. Therefore, it's not clear to me why it even needs to exist as an option; it should be safe to enable it at all times.

Separate from that, I agree with you and @MarcoGorelli that sometimes you might want to enforce the restriction that all the date strings must adhere to a single format. That can be done today with the format option (when the single format is known) or by a new option (which I called infer_strict_format) that would enable the behavior @MarcoGorelli is proposing when the single format is not known at the outset. My only quibble is that there are valid use cases for both the current and proposed behavior; I don't think it's safe to enable the new behavior unconditionally.

MarcoGorelli · 2022-09-19T08:55:37Z

@scott-r the current behaviour is unexpected to many and causes problems, see the many comments in #12585 and the many linked issues

The behaviour of infer_datetime_format is also unexpected to many, see here, here and here, and many more

And sure, mixed formats might be a valid use case, but they can still be parsed using apply (where at least it's clear that each format will be guessed individually), e.g.

pd.Series(['12-01-2000 00:00:00', '13 January 2000']).apply(pd.to_datetime)

so I'd be -1 on adding another boolean argument

+1 lgtm.

Thanks Jeff! I'll wait to see if others have objections, then I'll change the status to accepted

Dr-Irv · 2022-09-19T14:48:09Z

web/pandas/pdeps/0004-consistent-to-datetime-parsing.md

+- if no ``format`` is specified, ``pandas`` will guess the format from the first non-NaN row
+  and parse the rest of the input according to that format. Errors will be handled
+  according to the ``errors`` argument - there will be no silent switching of format;
+- ``infer_datetime_format`` will be deprecated;


If you were parsing 11-12-10, would the default be November 12, 2010, or December 11, 2010 or December 10, 2011? I think we should be explicit in this proposal on how dayfirst and yearfirst interacts with this particular behavior, especially since the docs say that those 2 parameters are not strict.

I dont know what infer_datetime_format currently does but is it not very restrictive, and not always possible, to infer from just the first element?

If that were to be the case I support the comment of outlining how to operate with year first and day first. As a European my colleagues and I hate US mm/dd/yy and I think this might also need outlining some specifics.

The advantage of providing a consistent format input is that better inference could be made from multiple samples and this is very useful for structured data.

Thanks both for taking a look!

@Dr-Irv this proposal wouldn't change how dayfirst and yearfirst operate. The format will try to be guessed in accordance with these parameters, just like it is on main - the difference is that with this proposal, the format guessed from the first non-NaN row will be used to parse the rest of the Series

@attack68 in the rare case that it's not possible to guess the format from the first element, then a UserWarning would be raised, check lines 49-55 of this PR

You're very right to bring up mm/dd/yy 👍 - indeed the vast majority of the world doesn't use that format. That's why the current behaviour is so dangerous. For example, suppose your data is in %d-%m-%Y %H:%M format:

On main, the first row's date would be parsed as mm-dd-yyyy, whilst the second one as dd-mm-yyyy. No error, no warning, this is very easy to miss (and I almost did once in a prod setting 😳 ):

In [1]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00']) Out[1]: DatetimeIndex(['2000-12-01', '2000-01-13'], dtype='datetime64[ns]', freq=None)

With this PDEP, you could just check the format of your first row, and you'd know the rest of the Series was parsed in accordance to that. If it can't be, then with errors='raise' (the default), you'd get an error

ValueError: time data '13-01-2000 00:00' does not match format '%m-%d-%Y %H:%M' (match)

and you'd see that the guessed format wasn't right. You could get around that either by explicitly passing format, or with dayfirst=True:

In [2]: pd.to_datetime(['12-01-2000 00:00', '13-01-2000 00:00'], dayfirst=True) Out[2]: DatetimeIndex(['2000-01-12', '2000-01-13'], dtype='datetime64[ns]', freq=None)

Totally agree on better documenting this, and that inference could be optimised by using multiple samples to guess - first, I just wanted to get agreement that we want to_datetime to parse consistently

srotondo · 2022-09-19T15:44:16Z

The behaviour of infer_datetime_format is also unexpected to many, see here, here and here, and many more

@MarcoGorelli What if we were to change the current function of infer_datetime_format to what you are proposing, which would cause parsing errors when the format is different so that it matches expectations, then added a different flag (perhaps called try_fast_parsing or something) with the current behavior of infer_datetime_format? That way we could fix any confusion with the current functionality, but we also wouldn't lose the functionality and options to_datetime currently has.

MarcoGorelli · 2022-09-19T15:53:41Z

Thanks for taking a look @srotondo

I don't think that would fix the confusion - in #12585 and linked issues, people are using to_datetime without infer_datetime_format yet are still expecting consistent parsing

And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)

Furthermore, I'd be -1 on adding yet another boolean argument thus increasing complexity even more

Dr-Irv · 2022-09-19T16:04:31Z

I don't think that would fix the confusion - in #12585 and linked issues, people are using to_datetime without infer_datetime_format yet are still expecting consistent parsing

And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)

What if you keep infer_datetime_format, but change its type to str and the default behavior. Values could be (open to changing these):

consistent : the new default, based on this proposal. Use first item to determine format, retain that format throughout
inconsistent: the format may change based on how each element is parsed (current behavior with value of True ?)
None: Do not infer, and only use the value of format (current default with False ?)

So people with code that doesn't specify infer_datetime_format will get the new behavior. If someone wants the old behavior, they can ask for it. If they do specify it as True or False, we raise.

srotondo · 2022-09-19T16:15:05Z

And if someone wants to retain the current functionality of guessing the format for each element individually, they could still do that with .apply(pd.to_datetime)

@MarcoGorelli I suppose if the current functionality is still available through .apply(), that's probably fine, but do take care that any change you make to to_datetime doesn't affect the outcome of .apply(to_datetime) and also make it very clear in the documentation that that is how you get the old behavior. I'm just not very comfortable with completely removing the old functionality from to_datetime since even though you stated that many people expect this proposed behavior, it's possible some people use and rely on the current behavior. But as long as you can clearly and easily find how to get that old behavior, it's probably alright. Just please be sure that the current behavior isn't completely removed and is still accessible somehow.

MarcoGorelli · 2022-09-19T16:46:20Z

@Dr-Irv sure, that'd be an option, but if there's gonna be a breaking change, I'd suggest we take the chance to simplify - having infer_datetime_format take three different options feels like too much complexity

jbrockmendel · 2022-09-19T17:00:16Z

+1 on the big picture. Implementation-wise, is the idea to pretty much replace our usage of dateutil.parser?

mroeschke · 2022-09-19T17:03:22Z

web/pandas/pdeps/0004-consistent-to-datetime-parsing.md

+However, it's read as "1st of December, 13th of January". No warning or error is thrown.
+
+Currently, the only way to ensure consistent parsing is by explicitly passing
+``format=``. The argument ``infer_datetime_format``


Minor but related, but it would be good to mention that infer_datetime_format should be strict with respect to being mixed with format

In [1]: pd.to_datetime(["2022-01-01"], infer_datetime_format=True, format="%Y-%m-%d") Out[1]: DatetimeIndex(['2022-01-01'], dtype='datetime64[ns]', freq=None) # Format doesn't match the input In [2]: pd.to_datetime(["2022-01-01"], infer_datetime_format=True, format="%m-%d-%Y") Out[2]: DatetimeIndex(['2022-01-01'], dtype='datetime64[ns]', freq=None)

i.e. it's not great that format != None and infer_datetime_format=True

mroeschke · 2022-09-19T17:10:19Z

If infer_datetime_format was deprecated, I do like @rhshadrach's suggestion #48621 (comment) of making guess_datetime_format public such that existing "functionality" isn't lost while making to_datetime stricter as the prior behavior would be essentially equivalent to:

to_datetime(arg, format=pd.tools.guess_datetime_format(arg[0]))

MarcoGorelli · 2022-09-19T18:09:24Z

+1 on the big picture. Implementation-wise, is the idea to pretty much replace our usage of dateutil.parser?

Thanks

dateutil.parser would still be used within guess_datetime_format, but then subsequent rows would be parsed with that guessed format rather than repeatedly calling dateutil.parser and risk having it silently switch format

I'm also on board with the suggestion to make guess_datetime_format public

…yfirst/yearfirst

MarcoGorelli · 2022-09-20T09:50:13Z

Thanks all for your feedback, I'll incorporated some points into the document

There's been a couple of approvals, so for now I've changed the status to accepted

MarcoGorelli · 2022-09-23T10:37:09Z

There's been 3 explicit approvals from core members, no "requested changes", and this has been open all week, so merging now

Thanks all for the discussion, aiming to start working on this soon-ish

datapythonista · 2022-09-24T12:03:07Z

I couldn't review before merged, but happy with the proposal here, really nice improvement.

I'm curious what's the suggested workaround mentioned in the PDEP for parsing columns with different formats. Should we mention it in the PDEP? Or the idea is to write it in the logs when implemented?

Also, seems like few formats are broken when rendered: https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html

MarcoGorelli · 2022-09-24T12:52:31Z

Thanks - yeah it's mentioned at the end of "detailed description"

If a user has dates in a mixed format, they can still use flexible parsing and accept the risks that poses, e.g.:
In [3]: pd.Series(['12-01-2000 00:00:00', '13-01-2000 00:00:00']).apply(pd.to_datetime)
Out[3]:
0   2000-12-01
1   2000-01-13
dtype: datetime64[ns]

rhshadrach · 2022-09-24T13:28:59Z

I'm curious what's the suggested workaround mentioned in the PDEP for parsing columns with different formats.

One way I read this question (and not sure if it's the right way) is if a frame has two columns each with a different (but self-consistent) format. They would be parsed individually, is that correct @MarcoGorelli?

MarcoGorelli · 2022-09-24T13:49:41Z

Ah apologies, I'd misunderstood

I presume you mean in read_csv? If so, then yes, e.g.:

data = io.StringIO('13-01-2000 00:00,13 Jan 2000\n14-01-2000 00:00,14 Jan 2000\n')
print(pd.read_csv(data, header=None, parse_dates=[0, 1]))

would give

           0          1
0 2000-01-13 2000-01-13
1 2000-01-14 2000-01-14

Each columns were parsed according to the format inferred from its respective first row.

If the first row of the first column had been 12-01-2000 00:00, then the inferred format would've been month-first, and so read_csv would've returned it unaltered as object, whilst the second column would still have been successfully converted

This wouldn't deviate from the current behaviour

datapythonista · 2022-09-24T14:42:02Z

Actually what I had in mind was answered in the first answer, I don't think my question was very clear. I missed the workaround when reading the PDEP, sorry.

datapythonista · 2022-09-24T12:04:08Z

web/pandas/pdeps/0004-consistent-to-datetime-parsing.md

+## Abstract
+
+The suggestion is that:
+- ``to_datetime`` becomes strict and uses the same datetime format to parse all elements in its input.


We should add a blank line before the list, that's markdown standard (GitHub comments allows it, but it's not allowed in the markdown spec)

datapythonista · 2022-09-24T12:09:46Z

web/pandas/pdeps/0004-consistent-to-datetime-parsing.md

+
+The whatsnew notes read
+
+> In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation.


Looks like we don't have style for blockquote in our website. I created #48758 to fix it.

* pdep-4: initial draft * note about making guess_datetime_format public, out of scope work, dayfirst/yearfirst Co-authored-by: MarcoGorelli <>

datapythonista · 2023-02-27T13:55:54Z

@MarcoGorelli was PDEP-4 fully implemented? Should we change its state to Implemented?

Also, I see there was a revision. Would it make sense to add the link to the revision PR to Discussion field, next to the original one?

pdep-4: initial draft

e59b6ff

MarcoGorelli added the PDEP pandas enhancement proposal label Sep 18, 2022

MarcoGorelli requested review from a team September 18, 2022 13:36

This was referenced Sep 18, 2022

Inconsistent date parsing of to_datetime #12585

Closed

ENH: make infer_datetime_format strict #48596

Closed

phofl reviewed Sep 18, 2022

View reviewed changes

jreback approved these changes Sep 18, 2022

View reviewed changes

MarcoGorelli mentioned this pull request Sep 18, 2022

BUG: Behavior with fallback between raise and coerce #46071 #47745

Closed

5 tasks

ssche mentioned this pull request Sep 19, 2022

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #12649

Closed

Dr-Irv reviewed Sep 19, 2022

View reviewed changes

mroeschke reviewed Sep 19, 2022

View reviewed changes

rhshadrach approved these changes Sep 19, 2022

View reviewed changes

note about making guess_datetime_format public, out of scope work, da…

851137d

…yfirst/yearfirst

pnacht mentioned this pull request Sep 20, 2022

Add OpenSSF Scorecards GitHub Action #48570

Merged

mroeschke approved these changes Sep 20, 2022

View reviewed changes

MarcoGorelli merged commit 7d852a9 into pandas-dev:main Sep 23, 2022

datapythonista reviewed Sep 24, 2022

View reviewed changes

hongyeehh mentioned this pull request Apr 13, 2023

TST: test error fixes with pandas 2.0 mie-lab/trackintel#470

Merged

mroeschke mentioned this pull request May 7, 2024

CLN: Remove deprecated read_csv(infer_datetime_format=) #58620

Merged


		## Motivation and Scope

		Pandas date parsing is very flexibible, but arguably too much so - see


		The whatsnew notes read

		> In the next major version release, 2.0, several larger API changes are being considered without a formal deprecation.

PDEP-4: consistent parsing of datetimes #48621

PDEP-4: consistent parsing of datetimes #48621

Conversation

MarcoGorelli commented Sep 18, 2022 • edited Loading

ahmetanildindar commented Sep 18, 2022

phofl left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback Sep 18, 2022

Choose a reason for hiding this comment

jreback Sep 18, 2022

Choose a reason for hiding this comment

MarcoGorelli commented Sep 18, 2022 • edited Loading

lordgrenville commented Sep 18, 2022

MarcoGorelli commented Sep 18, 2022

jreback commented Sep 18, 2022

MarcoGorelli commented Sep 18, 2022 • edited Loading

rhshadrach commented Sep 18, 2022 • edited Loading

scott-r commented Sep 18, 2022

lordgrenville commented Sep 19, 2022

scott-r commented Sep 19, 2022

MarcoGorelli commented Sep 19, 2022 • edited Loading

Dr-Irv Sep 19, 2022

Choose a reason for hiding this comment

attack68 Sep 19, 2022

Choose a reason for hiding this comment

MarcoGorelli Sep 19, 2022

Choose a reason for hiding this comment

srotondo commented Sep 19, 2022 • edited Loading

MarcoGorelli commented Sep 19, 2022

Dr-Irv commented Sep 19, 2022

srotondo commented Sep 19, 2022

MarcoGorelli commented Sep 19, 2022

jbrockmendel commented Sep 19, 2022

mroeschke Sep 19, 2022

Choose a reason for hiding this comment

mroeschke commented Sep 19, 2022

MarcoGorelli commented Sep 19, 2022

MarcoGorelli commented Sep 20, 2022

MarcoGorelli commented Sep 23, 2022

datapythonista commented Sep 24, 2022

MarcoGorelli commented Sep 24, 2022

rhshadrach commented Sep 24, 2022

MarcoGorelli commented Sep 24, 2022 • edited Loading

datapythonista commented Sep 24, 2022

datapythonista Sep 24, 2022 • edited Loading

Choose a reason for hiding this comment

datapythonista Sep 24, 2022

Choose a reason for hiding this comment

datapythonista commented Feb 27, 2023

MarcoGorelli commented Sep 18, 2022 •

edited

Loading

MarcoGorelli commented Sep 18, 2022 •

edited

Loading

MarcoGorelli commented Sep 18, 2022 •

edited

Loading

rhshadrach commented Sep 18, 2022 •

edited

Loading

MarcoGorelli commented Sep 19, 2022 •

edited

Loading

srotondo commented Sep 19, 2022 •

edited

Loading

MarcoGorelli commented Sep 24, 2022 •

edited

Loading

datapythonista Sep 24, 2022 •

edited

Loading