Skip to content

to_datetime function not working with %Y.%m.%d %H format #21422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johan12345 opened this issue Jun 11, 2018 · 14 comments · Fixed by #50242
Closed

to_datetime function not working with %Y.%m.%d %H format #21422

johan12345 opened this issue Jun 11, 2018 · 14 comments · Fixed by #50242
Labels
Bug Datetime Datetime data dtype

Comments

@johan12345
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import pandas as pd
import datetime as dt
print(dt.datetime.strptime('2012.01.01 1', '%Y.%m.%d %H')) 
# -> 2012-01-01 01:00:00
print(pd.to_datetime('2012.01.01 1', format='%Y.%m.%d %H')) 
# -> ValueError: time data '2012.01.01 1' doesn't match format specified

Problem description

When using to_datetime to parse a date that only includes an hour component, but not minutes and seconds, with a format that is otherwise similar to ISO8601 (such as the format '%Y.%m.%d %H'), a ValueError is raised (see above). This behavior is unexpected as strptime can parse the same date without any problem, using the same format string (see above).

I suspect that the problem is in the _format_is_iso function of pandas._libs.tslibs.parsing, where it is just checked if the ISO format starts with the format given - so this format is recognized as being ISO-like. In this case, the format passed to to_datetime is ignored and tslib.array_to_datetime function is used to parse the date instead, which doesn't seem to be able to handle this kind of format.

My current workaround is to modify the dates to also have a minutes component (append ':00' to every string) so that they can be parsed.

Expected Output

2012-01-01 01:00:00 (same as when using strptime)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-6-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.5.2
Cython: None
numpy: 1.14.2
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

Thanks for the report - issue seems to be here - raising an error if we reach the end of the string and the hour wasn't 2 digits, which looks unnecessary.

PR to fix welcome - to start with I think could just delete this conditional and see if any tests break (as well as adding new tests)

@chris-b1 chris-b1 added Bug Datetime Datetime data dtype labels Jun 11, 2018
@chris-b1 chris-b1 added this to the Next Major Release milestone Jun 11, 2018
@uds5501
Copy link
Contributor

uds5501 commented Jun 13, 2018

@chris-b1 deleting this conditional does solve @johan12345 's issue. But fails multiple tests. I will see if can modify those tests.

uds5501 added a commit to uds5501/pandas that referenced this issue Jun 13, 2018
BUG : to_datetime function not working with %Y.%m.%d %H format pandas-dev#21422
@WillAyd
Copy link
Member

WillAyd commented Jun 14, 2018

@jbrockmendel brings up a good point in questioning this in the associated PR. While I understand that the datetime module is parsing that single digit hour for the user, I'm not sure that can be guaranteed across all platforms. In fact, the Python documentation on "%H" from C89 states that it represents a "Hour (24-hour clock) as a zero-padded decimal number", which the example is not.

I think if pandas tries to make guarantees about how this gets parsed for non-standard directives that we'd be opening up a can of worms, so I'd vote for no action here.

Source:

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

@chris-b1
Copy link
Contributor

It appears in POSIX std, strptime will handle missing leading 0s for %H - we already do the same for %m, %d, etc, so I think it's reasonable to expand our parser.

http://pubs.opengroup.org/onlinepubs/009695399/functions/strptime.html

I really only worry about ambigous cases, is there is any risk "2016.05.06 1" is ambiguous with something other than pd.Timestamp('2016-05-05 01:00:00')? (honestly asking)

@jbrockmendel
Copy link
Member

FWIW this would be breaking with both numpy and dateutil.

@chris-b1
Copy link
Contributor

we broke with numpy a while back, and its parser is limited so not worried about that, but the dateutil one is more interesting, didn't know that.

from dateutil.parser import parser

parser().parse("2016.05.06 1")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-f30f9e318c90> in <module>()
----> 1 parser().parse("2016.05.06 1")

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\dateutil\parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    557 
    558         if res is None:
--> 559             raise ValueError("Unknown string format")
    560 
    561         if len(res) == 0:

ValueError: Unknown string format

parser().parse("2016.05.06 01")
Out[11]: datetime.datetime(2016, 5, 6, 1, 0)

@WillAyd
Copy link
Member

WillAyd commented Jun 14, 2018

@chris-b1 good info. Any concerns with Windows? Not sure where to find that information

@chris-b1
Copy link
Contributor

Windows being windows doesn't actually have a strptime in its c-runtime, but looks like std::get_time which it would have follows roughly the same spec with respect to leadings 0s.
http://en.cppreference.com/w/cpp/io/manip/get_time

@uds5501
Copy link
Contributor

uds5501 commented Jun 14, 2018

@chris-b1 Is it really worth messing with present parser? As @WillAyd mentioned, the bugs that may follow can be tricky to tackle with considering following a different path from numpy and dateutil parser (which isn't that helpful either, I reckon)

@chris-b1
Copy link
Contributor

To me it's only a question of whether we consider "2016.05.06 1" an ISO 8601 (loosely as we use it) formatted string, i.e. parses even if no format is passed.

Even if the answer to that is no, we need to match python if an explicit format spec is passed, so some change is necessary either way. It's a bug for pd.to_datetime(..., format='format') not to match datetime.strptime(..., 'format')

@johan12345
Copy link
Contributor Author

@chris-b1 Right - It is okay that the date is not parseable without specifying an explicit format, but when specifying the actual format, the behavior should be the same as with strptime.

Thanks everyone for the quick reactions, even before I had time to look into it in detail myself!

@mnjenga2

This comment was marked as off-topic.

@ghost

This comment was marked as off-topic.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@MarcoGorelli
Copy link
Member

This'll be addressed in #50242

thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
8 participants