-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Inconsistent date parsing of to_datetime #12585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So we ONLY hit the dateutil path (and hence use So easiest thing to do is track in
So this is a relatively easy first step, we could easily raise an ambiguity error and force the user to figure it out (unless coerceing of course). We could also (and maybe wait for an ambiguity error to occur), try several formats to see what parses. These options take into account all dates, to make this more performance we might want to take a sample (or a sample of the uniques is best). |
The |
Since mixed date formats within one series is unusual, why not raise a warning every time there are ambiguous values? |
I am not sure there is way to get from |
I am new to Pandas and github - please be gentle. If I understand the thread correctly - this is an unaddressed issue for 3.5 years? Don't you think that explicitly coding how a date should be read is an unreasonable solution? It's like importing integers and coding "randomly_scramble_my_number=False". This error is enough to ban use of Pandas in some large organisations :( |
There are 3000+ open issues in pandas. How should the handful of mostly volunteer folks prioritize? if large orgs want to just take and not help fix then I am not sure what to say. open source is about finding, reporting, AND helping to fix. there are a many orgs and many many individual contributors to pandas and open source generally that have built an amazing eco system. Would love for large orgs to step up and seriously fund open source. just taking and not giving back at all IMHO is terrible. |
Dear all, It seems to_datetime is causing significant problem. Original is in DD/MM/YYYY hh:mm:ss format But modified is in YYYY/DD/MM hh:mm:ss format I'd would expect the result as YYYY-MM-DD hh:mm:ss. I'd like to do anything I can for the solution. |
I've stumbled across this issue in working with our scientific dataset. Specifying the dayfirst argument to read_csv sorted us out. I would agree with the suggestion of printing a warning when inconsistent datetime parsing may be occurring. Can this be done after parsing by checking for inconsistent mapping from the string input to the datetime output across the whole series? Would this introduce a performance penalty? I'm keen to contribute a fix but I'm a newbie to the pandas project. |
@Craskermasker happy to have a contribution towards this |
I barely caught this error when trying to process a datasett, and I feel that it should be addressed. The only solution I see for solving this is to force the use of In pandas/pandas/_libs/tslibs/parsing.pyx Line 591 in 2990c1d
In the mean time I have reverted to using |
Just to give an update on this, the example from this issue will now show a warning: In [1]: import pandas as pd
In [2]: pd.to_datetime(["31/12/2014", "10/03/2011"])
<ipython-input-2-9e192223af3c>:1: UserWarning: Parsing '31/12/2014' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[2]: DatetimeIndex(['2014-12-31', '2011-10-03'], dtype='datetime64[ns]', freq=None) What's missing is to show a warning when the date string isn't a delimited date string (e.g. |
Just to say, for others on this thread- the warning happens for all date-time conversions, not just for those with the problem in question - where the parser changes its mind mid-conversion about what the days and months are. In [4]: pd.to_datetime(["31/12/2014"])
<ipython-input-4-9e280f89aa98>:1: UserWarning: Parsing '31/12/2014' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
pd.to_datetime(["31/12/2014"]) This is helpful - thank you - because it reminds users to specify the format or add the keyword argument, for all date-time parsing. But, because it is not specific to the change-mind case, and appears for all date-time parsing, I suspect a lot of users will get used to ignoring the warning, and will not realize what has happened in the mid-parsing switch. |
Just another tiny example to complement the original: [ins] In [5]: pd.to_datetime(['01-06-20', '13-06-20', '01-07-20', '13-07-20'])
Out[5]: DatetimeIndex(['2020-01-06', '2020-06-13', '2020-01-07', '2020-07-13'], dtype='datetime64[ns]', freq=None) Because the parser does not have a rule to keep it consistent, it thinks that the month comes before the day for the first and third dates, and the day comes before the month for the second and fourth. |
From the latest dev meeting, it was agreed that it would be good to aim for Background: pandas uses its own parsing when the date string format is standard (e.g. yyyy-mm-dd), and defers to When parsing dates using pandas' own parsers, pandas could already be strict - however, that'd be inconsistent with the non-strict parsing by dateutil, and users wouldn't know which one was being used nor what to expect. A couple of solutions could then be:
Approach 1: I looked into this in #47828. However, there are parts of the code where Either this code would need refactoring, or the warning could be filtered in the Approach 2: I'm leaning towards this at the moment. If users have non-standard dates they need to parse, they could use dateutil themselves and accept the guesswork, or explicitly pass EDIT: there's a 3rd approach, which I've outlined here, which would obliviate the need to remove dateutil or to emit a warning from it |
I've put together a PDEP for this: #48621 In addition to the core and triage teams, I'd encourage others who have commented here (@lordgrenville, @garfieldthecat, @paalge, @Craskermasker, @ahmetanildindar , @matthew-brett) to have a look - if you have concerns about this proposal, I'd appreciate your input |
As this comes up regularly:
yearfirst/datefirst
inread_csv
dayfirst=True, yearfirst=True
fixesread_csv
withdayfirst/yearfirst
The problem
Depending on the string formats, you can get inconsistent parsing (meaning: different interpretation of parts of the strings as days vs months between different values in the passed list/array of strings).
For example:
where the first is parsed as 'dayfirst' and the second not. Above particular case can be solved by passing
dayfirst=True
, but in any case, the default behaviour ofto_datetime
results in inconsistent parsing here.Another typical case if where you do provide
dayfirst=True
but have a malformed date, then it will fall back to 'not dayfirst' (#3341).The reason
The reason is the flexibility of
dateutil.parser.parse
. This is in the first place also a feature, as it lets you parse messy datetime colums. But when it comes to flexibility regardingdayfirst
/yearfirst
, this gives unexpected results.So dateutil will 'try' to use
dayfirst=True
, but when this does not give a valid date, it will ignoredayfirst=True
and parse the string anyway:The same is true for
dayfirst=False
(the default) as this will also try with 'day first' if the other way around failed.The issue for the lack of strictness of dayfirst is actually #3341.
Possible solutions
dayfirst=True
is sufficient (if you don't have messy or malformed dates).format
argument.We should probably stress this more in the docstring of
to_datetime
But, regardless of the two points above, fact is that the default behaviour of
to_datetime
(without passing any arguments) can do inconsistent parsing and so give unexpected results.Possible ways to have better default behaviour:
dateutil
should be strict aboutdayfirst
. I raised an issue for that here: Possibility to have a strict dayfirst dateutil/dateutil#214infer_datetime_format
toTrue
would solve it in some cases (where the first datetime has the correct format you want), but certainly not all.The text was updated successfully, but these errors were encountered: