Inconsistent date parsing of to_datetime #12585

jorisvandenbossche · 2016-03-10T10:30:31Z

As this comes up regularly:

BUG: Unexpected pd.to_datetime result #12583
pd.to_datetime/pd.DatetimeIndex unexpected behavior #12501
Bug: inconsistent date parsing by to_datetime #7348
Determine datetime_format from more than one element #15075
pandas.io.tests.test_parsers.Test*.test_yy_format fails in 0.18.0 #12611 deal with yearfirst/datefirst in read_csv
COMPAT: dateutil 2.5.2 broke sometests #12730, compat with dateutil 2.5.2 dayfirst=True, yearfirst=True fixes
Greedy date inference in read_csv leads to inconsistent data #14301 raise if inconsistency in read_csv with dayfirst/yearfirst
performance read_csv : using day first 23x to 35x slower than setting the format explicitly #25848
The problem

Depending on the string formats, you can get inconsistent parsing (meaning: different interpretation of parts of the strings as days vs months between different values in the passed list/array of strings).
For example:

In [22]: pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[22]: DatetimeIndex(['2014-12-31', '2011-10-03'])

where the first is parsed as 'dayfirst' and the second not. Above particular case can be solved by passing dayfirst=True, but in any case, the default behaviour of to_datetime results in inconsistent parsing here.
Another typical case if where you do provide dayfirst=True but have a malformed date, then it will fall back to 'not dayfirst' (#3341).

The reason

The reason is the flexibility of dateutil.parser.parse. This is in the first place also a feature, as it lets you parse messy datetime colums. But when it comes to flexibility regarding dayfirst/yearfirst, this gives unexpected results.
So dateutil will 'try' to use dayfirst=True, but when this does not give a valid date, it will ignore dayfirst=True and parse the string anyway:

In [3]: dateutil.parser.parse('1/12/2012', dayfirst=True)
Out[3]: datetime.datetime(2012, 12, 1, 0, 0)

In [4]: dateutil.parser.parse('1/13/2012', dayfirst=True)
Out[4]: datetime.datetime(2012, 1, 13, 0, 0)

The same is true for dayfirst=False (the default) as this will also try with 'day first' if the other way around failed.
The issue for the lack of strictness of dayfirst is actually #3341.

Possible solutions

In some cases passing dayfirst=True is sufficient (if you don't have messy or malformed dates).
The typical answer is saying: "if you want to be sure to have strict and consistent parsing, provide a format argument.
We should probably stress this more in the docstring of to_datetime

But, regardless of the two points above, fact is that the default behaviour of to_datetime (without passing any arguments) can do inconsistent parsing and so give unexpected results.
Possible ways to have better default behaviour:

Have the possibility to specify that dateutil should be strict about dayfirst. I raised an issue for that here: Possibility to have a strict dayfirst dateutil/dateutil#214
Changing the default of infer_datetime_format to True would solve it in some cases (where the first datetime has the correct format you want), but certainly not all.
The idea of raising a warning when inconsistent parsing happened, is AFAIK not possible as pandas does not know how dateutil parsed the dates. And always warning when no format is passed or could be inferred (so in practice, when dateutil is used), is maybe to drastically, as the flexible parsing is also a feature.

The text was updated successfully, but these errors were encountered:

jreback · 2016-03-10T13:04:26Z

So we ONLY hit the dateutil path (and hence use dayfirst/yearfirst settings) IF we fail to parse ISO8601 (IOW an exception is raised).

So easiest thing to do is track in array_to_datetime when/if this happens.

If we don't trigger the exception at all, then all dates are consistent (as ISO 8601, meaning dayfirst=False, yearfirst=False).
If we have at least 1 time we hit the exception (but not all), then we have a potential ambiguity (meaning
some dates were parsed as ISO 8601 and some maybe were not)
If we hit the exception EVERY time, then we have NO ISO8601 dates and I think the dayfirst/yearfirst flags would make this unambiguous (somewhat depends on dateutil).

So this is a relatively easy first step, we could easily raise an ambiguity error and force the user to figure it out (unless coerceing of course).

We could also (and maybe wait for an ambiguity error to occur), try several formats to see what parses.

These options take into account all dates, to make this more performance we might want to take a sample (or a sample of the uniques is best).

dolaameng · 2016-03-11T05:17:42Z

The dayfirst parameter to the reading functions such as read_csv is even more confusing, as multiple datetime columns with different formats might co-exist in the same dataset.

lordgrenville · 2018-08-16T13:51:38Z

Since mixed date formats within one series is unusual, why not raise a warning every time there are ambiguous values?

jorisvandenbossche · 2018-08-16T14:37:12Z

Since mixed date formats within one series is unusual, why not raise a warning every time there are ambiguous values?

I am not sure there is way to get from dateutil how (assuming which format) it parsed the date, so then it will also be difficult to warn for different formats.

erpda · 2020-02-25T18:48:00Z

I am new to Pandas and github - please be gentle.

If I understand the thread correctly - this is an unaddressed issue for 3.5 years? Don't you think that explicitly coding how a date should be read is an unreasonable solution? It's like importing integers and coding "randomly_scramble_my_number=False". This error is enough to ban use of Pandas in some large organisations :(

jreback · 2020-02-25T22:42:02Z

I am new to Pandas and github - please be gentle.

If I understand the thread correctly - this is an unaddressed issue for 3.5 years? Don't you think that explicitly coding how a date should be read is an unreasonable solution? It's like importing integers and coding "randomly_scramble_my_number=False". This error is enough to ban use of Pandas in some large organisations :(

@erpda

There are 3000+ open issues in pandas. How should the handful of mostly volunteer folks prioritize? if large orgs want to just take and not help fix then I am not sure what to say.

open source is about finding, reporting, AND helping to fix. there are a many orgs and many many individual contributors to pandas and open source generally that have built an amazing eco system.

Would love for large orgs to step up and seriously fund open source. just taking and not giving back at all IMHO is terrible.

ahmetanildindar · 2020-05-06T10:03:25Z

Dear all,

It seems to_datetime is causing significant problem.
Here are the codes I've ran. When compare the EARTHQUAKEDATE fields of the original and copied and to_datetime applied dataframes, I see the values are unfortunately different.

Original is in DD/MM/YYYY hh:mm:ss format
EARTHQUAKEDATE 05/10/1977 05:34:43

But modified is in YYYY/DD/MM hh:mm:ss format
EARTHQUAKEDATE 1977-05-10 05:34:43

I'd would expect the result as YYYY-MM-DD hh:mm:ss.

I'd like to do anything I can for the solution.

Craskermasker · 2020-07-01T13:55:00Z

I've stumbled across this issue in working with our scientific dataset. Specifying the dayfirst argument to read_csv sorted us out. I would agree with the suggestion of printing a warning when inconsistent datetime parsing may be occurring. Can this be done after parsing by checking for inconsistent mapping from the string input to the datetime output across the whole series? Would this introduce a performance penalty?

I'm keen to contribute a fix but I'm a newbie to the pandas project.

jreback · 2020-07-02T01:01:04Z

@Craskermasker happy to have a contribution towards this

paalge · 2020-07-02T13:36:27Z

I barely caught this error when trying to process a datasett, and I feel that it should be addressed.
My first thought was to make dayfirst True as default, but this does not work as if you input 10.20.2020 it converts it to 2020-10-20 without throwing an error 😮.

The only solution I see for solving this is to force the use of format, including making it a mandatory argument when using parse_dates in read_csv/read_tsv

In read_csv there should be a check that all dates are converted using the same format.From digging through the source code I would expect it to be this needs to be implemented in

pandas/pandas/_libs/tslibs/parsing.pyx

Line 591 in 2990c1d

def try_parse_date_and_time(

In the mean time I have reverted to using numpy.genfromtxt

MarcoGorelli · 2021-08-27T08:33:26Z

Just to give an update on this, the example from this issue will now show a warning:

In [1]: import pandas as pd

In [2]: pd.to_datetime(["31/12/2014", "10/03/2011"])
<ipython-input-2-9e192223af3c>:1: UserWarning: Parsing '31/12/2014' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
  pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[2]: DatetimeIndex(['2014-12-31', '2011-10-03'], dtype='datetime64[ns]', freq=None)

What's missing is to show a warning when the date string isn't a delimited date string (e.g. "31/12/2014 00:00:00") - this would involve adding a warning to when du_parse is called, and would result in many false positives, but maybe that's OK

matthew-brett · 2021-08-27T11:11:30Z

Just to say, for others on this thread- the warning happens for all date-time conversions, not just for those with the problem in question - where the parser changes its mind mid-conversion about what the days and months are.

In [4]: pd.to_datetime(["31/12/2014"])
<ipython-input-4-9e280f89aa98>:1: UserWarning: Parsing '31/12/2014' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
  pd.to_datetime(["31/12/2014"])

This is helpful - thank you - because it reminds users to specify the format or add the keyword argument, for all date-time parsing. But, because it is not specific to the change-mind case, and appears for all date-time parsing, I suspect a lot of users will get used to ignoring the warning, and will not realize what has happened in the mid-parsing switch.

matthew-brett · 2021-09-23T17:41:02Z

Just another tiny example to complement the original:

[ins] In [5]: pd.to_datetime(['01-06-20', '13-06-20', '01-07-20', '13-07-20'])                                                         
Out[5]: DatetimeIndex(['2020-01-06', '2020-06-13', '2020-01-07', '2020-07-13'], dtype='datetime64[ns]', freq=None)

Because the parser does not have a rule to keep it consistent, it thinks that the month comes before the day for the first and third dates, and the day comes before the month for the second and fourth.

MarcoGorelli · 2022-09-16T13:06:06Z

From the latest dev meeting, it was agreed that it would be good to aim for pd.to_datetime to be strict and to not allow mixed formats. Question is - how to do this?

Background: pandas uses its own parsing when the date string format is standard (e.g. yyyy-mm-dd), and defers to dateutil when pandas can't guess. dateutil is extremely flexible, though isn't strict, and it doesn't return the format it guessed (dateutil/dateutil#1138).

When parsing dates using pandas' own parsers, pandas could already be strict - however, that'd be inconsistent with the non-strict parsing by dateutil, and users wouldn't know which one was being used nor what to expect. A couple of solutions could then be:

strict pandas parsers, emit UserWarning if dateutil is used, OR
strict pandas parsers, error if pandas parser can't guess format - remove dateutil

Approach 1: I looked into this in #47828. However, there are parts of the code where dateutil is hit as part of a try-expect block, where the warning would be a false-positive. Example:

https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/tslibs/period.pyx#L2581-L2588

Either this code would need refactoring, or the warning could be filtered in the try block and only emitted if the else block is reached. I'd be -1 on this

Approach 2: I'm leaning towards this at the moment. If users have non-standard dates they need to parse, they could use dateutil themselves and accept the guesswork, or explicitly pass format=

EDIT: there's a 3rd approach, which I've outlined here, which would obliviate the need to remove dateutil or to emit a warning from it

MarcoGorelli · 2022-09-18T13:45:33Z

I've put together a PDEP for this: #48621

In addition to the core and triage teams, I'd encourage others who have commented here (@lordgrenville, @garfieldthecat, @paalge, @Craskermasker, @ahmetanildindar , @matthew-brett) to have a look - if you have concerns about this proposal, I'd appreciate your input

jorisvandenbossche added Datetime Datetime data dtype Master Tracker High level tracker for similar issues labels Mar 10, 2016

This was referenced Mar 10, 2016

BUG: Unexpected pd.to_datetime result #12583

Closed

Bug: inconsistent date parsing by to_datetime #7348

Closed

hayd mentioned this issue Mar 10, 2016

BUG/ENH: to_datetime dayfirst ought to be strict #3341

Closed

jorisvandenbossche mentioned this issue Mar 10, 2016

Possibility to have a strict dayfirst dateutil/dateutil#214

Open

This was referenced Mar 14, 2016

pandas.io.tests.test_parsers.Test*.test_yy_format fails in 0.18.0 #12611

Closed

PERF: parse non-ambiguous date formats in c #12667

Closed

jreback added this to the 0.18.1 milestone Mar 29, 2016

jreback mentioned this issue Apr 6, 2016

Garbled dates in pandas 0.18.0 #12808

Closed

jreback modified the milestones: 0.18.2, 0.18.1 Apr 26, 2016

jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Aug 13, 2016

chris-b1 mentioned this issue Sep 26, 2016

Greedy date inference in read_csv leads to inconsistent data #14301

Closed

jreback mentioned this issue Jan 6, 2017

Determine datetime_format from more than one element #15075

Closed

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

chris-b1 mentioned this issue Jul 11, 2017

Weird datetime string default parsing on to_datetime #16882

Closed

jreback modified the milestones: Next Major Release, High Level Issue Tracking Sep 24, 2017

chris-b1 mentioned this issue Jan 26, 2018

Inconsistent date parsing with read_csv #19402

Closed

TomAugspurger removed the Master Tracker High level tracker for similar issues label Jul 6, 2018

TomAugspurger removed this from the High Level Issue Tracking milestone Jul 6, 2018

mroeschke added the Bug label Mar 31, 2020

simonjayhawkins mentioned this issue Jun 9, 2020

BUG: to_datetime do not keep the date format throughout the column when using inferred format #34546

Closed

1 task

simonjayhawkins mentioned this issue Jul 24, 2020

BUG: DateTime Index format bug when importing from .csv file #35402

Closed

3 tasks

arw2019 mentioned this issue Jul 27, 2020

WIP BUG: Inconsistent date parsing of to_datetime #35428

Closed

6 tasks

jreback mentioned this issue Sep 7, 2020

BUG: Inconsistent conversion of to_datetime #36190

Closed

3 tasks

jreback mentioned this issue Nov 17, 2020

BUG: pandas read_csv silently switches date parsing scheme in the middle of reading file corrupting data #37895

Closed

jreback mentioned this issue Jul 17, 2021

BUG: to_datetime(): Unstable default dateformat assumptions #42584

Closed

This was referenced Aug 3, 2021

QST:Somehow parse dates in pandas is messing the data as shown here: https://youtu.be/m_4pbEyXSds #42793

Closed

Inconsistent date parsing of to_datetime #42908

Merged

MarcoGorelli mentioned this issue Aug 27, 2021

Warn on du parse #43247

Closed

4 tasks

benoit9126 mentioned this issue Apr 19, 2022

BUG: UserWarning about lacking infer_datetime_format with pd.to_datetime #46210

Closed

3 tasks

MarcoGorelli mentioned this issue Jul 23, 2022

WIP warn if parsing with du_parse #47828

Closed

5 tasks

This was referenced Sep 17, 2022

ENH: make infer_datetime_format strict #48596

Closed

PDEP-4: consistent parsing of datetimes #48621

Merged

ssche mentioned this issue Sep 18, 2022

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #12649

Closed

MarcoGorelli mentioned this issue Oct 17, 2022

PDEP0004: implementation #49024

Merged

6 tasks

MarcoGorelli closed this as completed in #49024 Dec 13, 2022

MarcoGorelli mentioned this issue Dec 13, 2022

WARN: warning shown when parsing delimited date string even if users can't do anything about it #50231

Closed

3 tasks

MarcoGorelli mentioned this issue Apr 21, 2023

feat(python): Allow %f in strptime format strings pola-rs/polars#8404

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent date parsing of to_datetime #12585

Inconsistent date parsing of to_datetime #12585

jorisvandenbossche commented Mar 10, 2016 •

edited by mroeschke

Loading

jreback commented Mar 10, 2016

dolaameng commented Mar 11, 2016

lordgrenville commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

erpda commented Feb 25, 2020 •

edited

Loading

jreback commented Feb 25, 2020 •

edited

Loading

ahmetanildindar commented May 6, 2020 •

edited

Loading

Craskermasker commented Jul 1, 2020

jreback commented Jul 2, 2020

paalge commented Jul 2, 2020 •

edited

Loading

MarcoGorelli commented Aug 27, 2021

matthew-brett commented Aug 27, 2021

matthew-brett commented Sep 23, 2021

MarcoGorelli commented Sep 16, 2022 •

edited

Loading

MarcoGorelli commented Sep 18, 2022

Inconsistent date parsing of to_datetime #12585

Inconsistent date parsing of to_datetime #12585

Comments

jorisvandenbossche commented Mar 10, 2016 • edited by mroeschke Loading

jreback commented Mar 10, 2016

dolaameng commented Mar 11, 2016

lordgrenville commented Aug 16, 2018

jorisvandenbossche commented Aug 16, 2018

erpda commented Feb 25, 2020 • edited Loading

jreback commented Feb 25, 2020 • edited Loading

ahmetanildindar commented May 6, 2020 • edited Loading

Craskermasker commented Jul 1, 2020

jreback commented Jul 2, 2020

paalge commented Jul 2, 2020 • edited Loading

MarcoGorelli commented Aug 27, 2021

matthew-brett commented Aug 27, 2021

matthew-brett commented Sep 23, 2021

MarcoGorelli commented Sep 16, 2022 • edited Loading

MarcoGorelli commented Sep 18, 2022

jorisvandenbossche commented Mar 10, 2016 •

edited by mroeschke

Loading

erpda commented Feb 25, 2020 •

edited

Loading

jreback commented Feb 25, 2020 •

edited

Loading

ahmetanildindar commented May 6, 2020 •

edited

Loading

paalge commented Jul 2, 2020 •

edited

Loading

MarcoGorelli commented Sep 16, 2022 •

edited

Loading