Skip to content

BUG: pandas read_csv silently switches date parsing scheme in the middle of reading file corrupting data #37895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jontis opened this issue Nov 16, 2020 · 3 comments
Labels
Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv

Comments

@jontis
Copy link

jontis commented Nov 16, 2020

In pandas 1.1.4

# Your code here
pd.DataFrame(data={'timestamp':  ['13/06/2018', '12/06/2018']}).to_csv('test.csv', index=False)
pd.read_csv('test.csv', parse_dates=['timestamp'])

Out[84]: 
   timestamp
0 2018-06-13
1 2018-12-06

Problem description

I provoked pandas here (initially by mistake) by letting it parse a file with date format day / month / year, with it's default parser month / day / year. To my horror, it parses the ones that give a valid date with the default parser, and automatically uses the alternative parser on the ones that fails. Giving a mix of dates and without any warning.

Expected Output

Pandas should have failed with an error, or a very severe warning when a date failed to parse with the same parser as the other dates. It should not be allowed to use different parsing schemes within the same data file.

pandas : 1.1.4
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1

@jontis jontis added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 16, 2020
@jreback
Copy link
Contributor

jreback commented Nov 16, 2020

this is a duplicate issue
pls have a look - and welcome patches

@jontis
Copy link
Author

jontis commented Nov 16, 2020

I have noticed some issues and discussions about pandas being "flexible" and being able to infer date format and handle different date formats within files. I may consider it a questionable strength to be able to read different date formats from different feature columns, but switching format within the same column is very dangerous and should be off by default.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2020

duplicate of #12585
bugs - even long standing ones are fixed by community contributions which we are happy to take

@jreback jreback added IO CSV read_csv, to_csv and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 17, 2020
@jreback jreback added this to the No action milestone Nov 17, 2020
@jreback jreback added Duplicate Report Duplicate issue or pull request Datetime Datetime data dtype labels Nov 17, 2020
@jreback jreback closed this as completed Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

2 participants