Greedy date inference in read_csv leads to inconsistent data #14301

njr0 · 2016-09-26T16:30:41Z

A small, complete example of the issue

import pandas as pd

df = pd.read_csv('greedy.txt', parse_dates=['EuroDate'])

print(df)

with "greedy.txt" as

EuroDate
10/9/2016
30/9/2016

Expected Output

    EuroDate
0 2016-09-10
1 2016-09-30

Actual Output

        EuroDate
    0 2016-10-09
    1 2016-09-30

greedy.txt

So read_csv() has interpreted the first line as a US-format date, then realised that the second line cannot be a US-formatted date, so switched to European format. But it has not gone back and reevaluated the first line in light of its new information. So the resulting data is inconsistent, and pandas knows this.

Obviously, I appreciate that

CSV files are a disaster
This code is asking Pandas to infer the dates
Going back and re-evaluating previous data in light of new information is slow and annoying.
It won't always be possible to do anything except interpret a field as a string if there is inconsistent data.
Some datasets will include dates in multiple formats (e.g. if humans have entered them free-form) and in those cases is might just be useful for Pandas to take its best guess on a row-by-row basis.

However, I contend that in this case the behaviour is incorrect (because there is a consistent interpretation of the column as a date, which is in fact clear by the second record). Even if some people don't regard this as a bug, I contend that it is at the very least dangerous and likely to cause serious (and sometimes baffling) errors. In my view, it would be much better to go back and reinterpret the data according to the information now available or to fail. If even this is considered too much, at the very least Pandas should issue a prominent warning that it has interpreted different rows in the column using different date formats.

Output of `pd.show_versions()`

## INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 19.4
Cython: 0.24
numpy: 1.11.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-09-26T16:48:21Z

xref #12585, this is essentially a symptom of it.

Note that there is a dayfirst= argument on read_csv for this exact case

jreback · 2016-09-26T17:44:22Z

yes this is a duplicate of that issue.

@njr0 your comments are appreciated and a pull-request is welcome! (for now best to maybe raise an error if inconsistency exists and let the user be more explict).

jreback closed this as completed Sep 26, 2016

jreback added Datetime Datetime data dtype Compat pandas objects compatability with Numpy or Python functions labels Sep 26, 2016

jreback added this to the No action milestone Sep 26, 2016

jreback added the Duplicate Report Duplicate issue or pull request label Sep 26, 2016

jorisvandenbossche mentioned this issue Sep 26, 2016

Inconsistent date parsing of to_datetime #12585

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Greedy date inference in read_csv leads to inconsistent data #14301

Greedy date inference in read_csv leads to inconsistent data #14301

njr0 commented Sep 26, 2016 •

edited

Loading

chris-b1 commented Sep 26, 2016

jreback commented Sep 26, 2016

Greedy date inference in read_csv leads to inconsistent data #14301

Greedy date inference in read_csv leads to inconsistent data #14301

Comments

njr0 commented Sep 26, 2016 • edited Loading

A small, complete example of the issue

Expected Output

Actual Output

Output of pd.show_versions()

chris-b1 commented Sep 26, 2016

jreback commented Sep 26, 2016

njr0 commented Sep 26, 2016 •

edited

Loading

Output of `pd.show_versions()`