Greedy date inference in read_csv leads to inconsistent data #14301
Labels
Compat
pandas objects compatability with Numpy or Python functions
Datetime
Datetime data dtype
Duplicate Report
Duplicate issue or pull request
A small, complete example of the issue
with "greedy.txt" as
Expected Output
Actual Output
greedy.txt
So
read_csv()
has interpreted the first line as a US-format date, then realised that the second line cannot be a US-formatted date, so switched to European format. But it has not gone back and reevaluated the first line in light of its new information. So the resulting data is inconsistent, and pandas knows this.Obviously, I appreciate that
However, I contend that in this case the behaviour is incorrect (because there is a consistent interpretation of the column as a date, which is in fact clear by the second record). Even if some people don't regard this as a bug, I contend that it is at the very least dangerous and likely to cause serious (and sometimes baffling) errors. In my view, it would be much better to go back and reinterpret the data according to the information now available or to fail. If even this is considered too much, at the very least Pandas should issue a prominent warning that it has interpreted different rows in the column using different date formats.
Output of
pd.show_versions()
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 19.4
Cython: 0.24
numpy: 1.11.1
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: