Skip to content

CSV parse_dates not working if multiline header is specified. #8991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
roldugin opened this issue Dec 4, 2014 · 3 comments · Fixed by #44408
Closed

CSV parse_dates not working if multiline header is specified. #8991

roldugin opened this issue Dec 4, 2014 · 3 comments · Fixed by #44408
Labels
Milestone

Comments

@roldugin
Copy link

roldugin commented Dec 4, 2014

I can't seem to be able to parse dates of multiheader files... Here's an example illustrating the problem I'm facing:

import pandas as pd
from io import StringIO

csv = """date,time,value
YYYY/MM/DD,HH:MM,Smp
2014/12/01,00:00,1
2014/12/01,01:00,2"""

pd.read_csv(StringIO(csv), header=[0,1], parse_dates=[['date','time']])
# ValueError: 'date' is not in list
@jreback
Copy link
Contributor

jreback commented Dec 4, 2014

This is a bit tricky. The problem is the column names are are really tuples.

e.g. this works but is ugly

In [22]: df = pd.read_csv(StringIO(csv), header=[0,1], parse_dates=[[('date','YYYY/MM/DD'),('time','HH:MM')]])

In [23]: df
Out[23]: 
  ('date', 'YYYY/MM/DD')_('time', 'HH:MM')  (value, Smp)
0                      2014-12-01 00:00:00             1
1                      2014-12-01 01:00:00             2

In [24]: df.dtypes
Out[24]: 
('date', 'YYYY/MM/DD')_('time', 'HH:MM')    datetime64[ns]
(value, Smp)                                         int64
dtype: object

This is a much better way to approach this problem (and much faster)


In [10]: df = pd.read_csv(StringIO(csv), header=[0,1])

In [11]: df
Out[11]: 
         date   time value
   YYYY/MM/DD  HH:MM   Smp
0  2014/12/01  00:00     1
1  2014/12/01  01:00     2

In [12]: pd.to_datetime(df.iloc[:,0] + ' ' + df.iloc[:,0])
Out[12]: 
0   2014-12-01 01:14:00
1   2014-12-01 01:14:00
Name: (date, YYYY/MM/DD), dtype: datetime64[ns]

so i'll call this an enhancement

@jreback jreback added IO CSV read_csv, to_csv Dtype Conversions Unexpected or buggy dtype conversions MultiIndex labels Dec 4, 2014
@jreback jreback added this to the 0.16.0 milestone Dec 4, 2014
@roldugin
Copy link
Author

roldugin commented Dec 4, 2014

Thank you for the fast reply, @jreback and for clearing this up. I like the idea of doing it after CSV parsing.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@mroeschke mroeschke added Enhancement and removed Dtype Conversions Unexpected or buggy dtype conversions labels Apr 11, 2021
@phofl
Copy link
Member

phofl commented Nov 12, 2021

The approach suggested by @jreback stopped working sometime in the past:

data = """a,b,c
1,2,3
5,6,7
4,,6.0"""

result = pd.read_csv(
    StringIO(data),
    engine="python",
    parse_dates=[("a", "1")],
    header=[0, 1],
)

This throws:

Traceback (most recent call last):
  File "/home/developer/.config/JetBrains/PyCharm2021.2/scratches/scratch.py", line 16, in <module>
    result = pd.read_csv(
  File "/home/developer/PycharmProjects/pandas/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 639, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 535, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 888, in __init__
    self._engine = self._make_engine(self.engine)
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/readers.py", line 1139, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/python_parser.py", line 154, in __init__
    self._validate_parse_dates_presence(self.columns)
  File "/home/developer/PycharmProjects/pandas/pandas/io/parsers/base_parser.py", line 280, in _validate_parse_dates_presence
    raise ValueError(
ValueError: Missing column provided to 'parse_dates': '1, a'

Process finished with exit code 1

Edit: Same behavior for c engine

@phofl phofl added Bug and removed Enhancement labels Nov 12, 2021
@jreback jreback modified the milestones: Contributions Welcome, 1.4 Nov 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants