Skip to content

Timezones silently dropped in parsing #18702

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Dec 9, 2017 · 6 comments · Fixed by #51477
Closed

Timezones silently dropped in parsing #18702

jbrockmendel opened this issue Dec 9, 2017 · 6 comments · Fixed by #51477
Labels
Bug Timezones Timezone data dtype

Comments

@jbrockmendel
Copy link
Member

TLDR: pandas should pass a tzinfos kwarg to the dateutil parser using sensible defaults.

dateutil has a bug that silently drops most timezones. That bug is inherited by pandas. The following is run on a machine located in US/Pacific:

>>> pd.Timestamp('2017-12-08 08:20 PM PST')     # <-- only parsed correctly because of locale
Timestamp('2017-12-08 20:20:00-0800', tz='tzlocal()')
>>> pd.Timestamp('2017-12-08 08:20 PM EST')     # <-- timezone silently dropped
Timestamp('2017-12-08 20:20:00')

There is a partial fix in progress over at dateutil, the most likely outcome of which is that these cases will raise in the future unless a tzinfos kwarg is explicitly passed to dateutil.parser.parse. The issue for pandas is then to decide on what tzinfos to pass (a suggestion to handle the most common use cases by default within dateutil went nowhere).

The tzinfos kwarg is a dictionary taking a string and returning a tzinfo object, e.g.

unambiguous_tzinfos = {
    'PDT': dateutil.tz.gettz('US/Pacific'),
    'PT': dateutil.tz.gettz('US/Pacific'),
    'MDT': dateutil.tz.gettz('US/Mountain'),
    'MT': dateutil.tz.gettz('US/Mountain'),
    'ET': dateutil.tz.gettz('US/Eastern'),
    'CET': dateutil.tz.gettz('Europe/Amsterdam),
    'NZDT': dateutil.tz.gettz('Pacific/Auckland')}

This example includes only abbreviations for which there are no other alternatives listed here. So e.g. "CST" is excluded since it could also be "China Standard Time", "EST" is excluded since it could refer to "Australian Eastern Standard Time". Note this is only a subset of the unambiguous abbreviations.

@jreback
Copy link
Contributor

jreback commented Dec 9, 2017

hmm ok, I would rather hand off non-iso 8601 parsing to dateutil directly, so this would qualitfy. note that this only when format is not passed and in a very limited set of cases.

@jreback jreback added this to the Next Major Release milestone Dec 9, 2017
@jbrockmendel
Copy link
Member Author

I'd prefer that dateutil handle this internally too; my hope is that consensus will develop over there once more people report that it doesn't Just Work. But until then, it's still a nontrivial question of exactly what we want to recognize by default and whether/how to let users customize it.

I see two viable options:

  1. The most convenient thing to do -- at least in my comfortably Anglo-centric seat -- would be to pass defaults for a) abbreviations that are unambiguous and b) abbreviations for the most common timezones, e.g. assume CDT means "Central Daylight Time" and not "Cuba Daylight Time". Users who want to override that would need to do the parsing step before passing to the Timestamp/to_datetime constructor.

  2. Same as 1, but allow users a mechanism to override the tzinfos dict that pandas passes to dateutil.

@jreback
Copy link
Contributor

jreback commented Dec 9, 2017

we shouldn’t be hard coding any time zones
i would think u can simply pull out the string and just try to localize

@jbrockmendel
Copy link
Member Author

i would think u can simply pull out the string and just try to localize

Can you expand on that? Are you suggesting users should do this before passing to Timestamp/to_datetime?

@jreback
Copy link
Contributor

jreback commented Dec 10, 2017

of course not

when parsing if u hit something that looks like a tz
rather than an offset u can simply take the string and localize

@jbrockmendel
Copy link
Member Author

of course not

Good. That seemed unlikely (and altogether silly).

when parsing if u hit something that looks like a tz rather than an offset u can simply take the string and localize

It's the "simply" that I'm having trouble with. here. This sounds like you're suggesting the parsing be done within pandas, which I thought was what we're trying to avoid. Can you give an example of what you have in mind?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants