Skip to content

tz_localize should support is_dst input array #7943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rockg opened this issue Aug 5, 2014 · 14 comments · Fixed by #7963
Closed

tz_localize should support is_dst input array #7943

rockg opened this issue Aug 5, 2014 · 14 comments · Fixed by #7963
Labels
Milestone

Comments

@rockg
Copy link
Contributor

rockg commented Aug 5, 2014

When storing datetimes with timezone information in mysql I split out the is_dst flag into a separate column. Then when reconstructing the Timestamps I am either forced to iterate through each row and call pytz.timezone.localize on every Timestamp which is very slow or do some magic with localizing what I can and then manually dealing with the fall transition time (note that infer_dst won't work because there could be many rows that have transitions in them). I would much rather create the DatetimeIndex from the column of dates and then call tz_localize with the is_dst column. This would then appropriately set the offset.

di = DatetimeIndex(frame['DateColumn'])
di = di.tz_localize(TimeZone, is_dst_flat=frame['IsDstColumn'])

Thoughts?

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

we already support infer_dst? how does that interact?

@rockg
Copy link
Contributor Author

rockg commented Aug 5, 2014

infer_dst only works when there are only two dates for the fall dst transition time. If there are more than two, then it cannot infer simply based on order. In that case, something else needs to be done to explicitly tell what times are dst times and what times are not.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

ok, makes sense. would making infer_dst='force' make sense? or you simply want another keyword? (is_dst=None). I suppose its an error if is_dst is not None and infer_dst=True ?

@rockg
Copy link
Contributor Author

rockg commented Aug 5, 2014

I would like an is_dst keyword and this also mirrors pytz's localize which takes is_dst as a parameter. I agree that it is an error if is_dst is not None and infer_dst=True (they are mutually exclusive). I'll take a stab add adding it to localize. Should be straightforward if I recall.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

ok, sounds good

@jreback jreback added this to the 0.15.0 milestone Aug 5, 2014
@ischwabacher
Copy link
Contributor

I suspect that is_dst is too entrenched for me to have a chance of painting this bikeshed my color, but I am against is_dst as a name because not all time zone discontinuities are DST boundaries. I think in this case, it is probably conceptually clearer to split off DST information as a timedelta64 offset rather than an is_dst flag, which among other things makes converting to UTC trivial.

That said, since the problem of converting local time and time zone to UTC (essentially inverting the time zone, if we view it as a map from UTC to local time) is simply one of selecting the correct UTC time from a list of zero, one or two times, and since in the case of an ambiguous time at a DST boundary the DST time must always come first, I propose that if we use is_dst as a parameter, it should mean "select the first (as opposed to the last) UTC time that maps to this local time, regardless of DST", and is_dst=True should be the default.

I also propose that we not put any effort into supporting the case where there are three or more UTC times corresponding to a given local time, because this case requires pathological government action and I don't know of any instance where it has happened.

@rockg
Copy link
Contributor Author

rockg commented Aug 7, 2014

Your second paragraph is not exactly correct and why this is needed. If you only had two ambiguous times, then the infer_dst argument is all that is needed and works like you say. However, when there are more times it becomes a problem. For example, when reading database records where there are multiple items that have ambiguous times associated with them, rather than discretizing among those items and inferring the right dst/non-dst time from for each item it will be much faster and easier to already use a pre-computed column when the data was input in my particular case. Here's a simple example of data I'm looking at.

Item,Time,IS_DST
1,2014-11-02 00:00,1
1,2014-11-02 01:00,1
1,2014-11-02 01:00,0
2,2014-11-02 00:00,1
2,2014-11-02 01:00,1
2,2014-11-02 01:00,0

@ischwabacher
Copy link
Contributor

What I'm proposing is that (if you are in America/Chicago) you will have this:

Item,Time,Offset
1,2014-11-02 00:00,-05:00:00
1,2014-11-02 01:00,-05:00:00
1,2014-11-02 01:00,-06:00:00
2,2014-11-02 00:00,-05:00:00
2,2014-11-02 01:00,-05:00:00
2,2014-11-02 01:00,-06:00:00

Which is just as good.

@rockg
Copy link
Contributor Author

rockg commented Aug 7, 2014

Yes, I know that's your first point...I was more addressing your subsequent points.

I agree that it's just as good and contains the same information. My data is structured with bools and hence I want to make that work (it was only a couple lines of code to make this work). I don't know offhand how to make the offsets in the underlying tz_localize code. I don't see any harm in making both work if possible.

@mtrbean
Copy link
Contributor

mtrbean commented Aug 7, 2014

To @ischwabacher's point, I'm not against adding is_dst= to tz_localize(). But I also agree that a better solution to @rockg's problem is to store timestamps in UTC and timezone separately in two columns in a database table.

I always store timestamp in UTC in database to avoid ambiguity, unless the database has explicit support for timestamp with time zone like postgresql.

@rockg
Copy link
Contributor Author

rockg commented Aug 7, 2014

While storing data in UTC is unambiguous, it makes it very difficult when people are working with data in the database. With my work, having the data stored in the actual time zone has saved many errors with users not having to manually do the dst offset in their heads...it's so much easier and not worth the hassle for only 2 out of 8760 hours.

@ischwabacher
Copy link
Contributor

Yes, I know that's your first point...I was more addressing your subsequent points.

...sorry, I can't count.

My second paragraph was talking about resolving one individual time, not inferring DST from the whole vector. So 2013-11-3 01:00:00 in time zone America/Chicago could resolve to 2013-11-3 01:00:00-0500 (DST) or 2013-11-3 01:00:00-0600 (not DST), and the DST time comes first chronologically. So letting is_dst=True mean "choose the first time that this could be interpreted as" gives the correct answer. I'm not talking about "let the first occurrence of this time in my vector be interpreted one way and the next the other".

But not all time zone discontinuities are DST boundaries (and not all DST boundaries are time zone discontinuities)! Here's what pytz has to say on the subject:

Although localize() handles many cases, it is still not possible to handle all. In cases where countries change their timezone definitions, cases like the end-of-daylight-saving-time occur with no way of resolving the ambiguity. For example, in 1915 Warsaw switched from Warsaw time to Central European time. So at the stroke of midnight on August 5th 1915 the clocks were wound back 24 minutes creating an ambiguous time period that cannot be specified without referring to the timezone abbreviation or the actual UTC offset. In this case midnight happened twice, neither time during a daylight saving time period:

>>> warsaw = pytz.timezone('Europe/Warsaw')
>>> loc_dt1 = warsaw.localize(datetime(1915, 8, 4, 23, 59, 59), is_dst=False)
>>> loc_dt1.strftime(fmt)
'1915-08-04 23:59:59 WMT+0124'
>>> loc_dt2 = warsaw.localize(datetime(1915, 8, 5, 00, 00, 00), is_dst=False)
>>> loc_dt2.strftime(fmt)
'1915-08-05 00:00:00 CET+0100'
>>> str(loc_dt2 - loc_dt1)
'0:24:01'

Clearly, we need a way to handle this case; also clearly, the mechanism should be the same as the handling of the DST case because the same basic thing is going on: the time zone offset is jumping down as we cross a boundary. My most preferred solution to this problem is to make it easy and natural to use local time + offset to specify time, but this will be a pain in the butt sometimes (and you don't want to make your users remember that it's 24 minutes, especially since pytz rounds to the nearest minute and the correct value in Zoneinfo is probably different). My next preferred solution is to have a keyword that works the way I described is_dst working in my first comment (i.e., is_dst=True here will yield WMT, while is_dst=False will yield CET, regardless of the fact that neither side was actually DST), but has a different name. My next preferred solution is calling it is_dst, and my least preferred solution is not handling this at all.

EDIT: Just noticed that the proposed commit does essentially what I suggested. I still prefer something like ambiguous=['raise', 'NaT', ndarray(dtype=bool), 'infer'], but an is_dst that properly handles non-DST discontinuities is good enough.

@rockg
Copy link
Contributor Author

rockg commented Aug 9, 2014

It has frustrated me about tz_localize on the timestamp not having a is_dst like localize in pytz. Now it will have a flag and we can resolve individual times (I thought about putting it on the Timestamp constructor, but that was much more complicated than tz_localize).

If anybody is actually dealing with 1915 hourly Warsaw times I will be very interested. I think we have a decent solution to cover 99.99% of the cases people use this for. I do like the ambiguous better than is_dst (I was using is_dst because that's what pytz calls it and most people would be familiar with that).

@ischwabacher
Copy link
Contributor

Offsets sometimes change for geopolitical reasons. Europe/Simferopol isn't a good example because it was invaded from the East and in the spring, but I'm sure you can imagine other territories changing hands at inconvenient times of year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants