-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Datetime parsing (PDEP-4): allow mixture of ISO formatted strings? #50411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is also a bit inconsistent with the case when we cannot infer the format. In such a case, we actually still do parse each string individually (just raise a warning about doing that). For example from the original PR:
|
would |
that works for the example you posted, and is performant In [8]: pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"], exact=False)
Out[8]: DatetimeIndex(['2022-01-01 09:00:00', '2022-01-02 09:00:00.123000'], dtype='datetime64[ns]', freq=None) |
That only worked because of the exact ordering that I used. Switching the values: In [15]: pd.to_datetime(["2022-01-01T09:00:00.123", "2022-01-02T09:00:00"], exact=False)
...
ValueError: time data "2022-01-02T09:00:00" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S.%f" In that case, you could still provide the shorter format manually:
But, I am wondering: isn't that another bug? I would have expected it would ignore the sub-second values (and so wanted to say that this is not a full replacement ..). But apparently we still do parse the seconds, while with |
I don't think it's a bug: neither stdlib nor pyarrow would allow that: In [9]: pa.compute.strptime('2022-01-01T09:00:00.123', '%Y-%m-%dT%H:%M:%S', unit='ns')
...
ArrowInvalid: Failed to parse string: '2022-01-01T09:00:00.123' as a scalar of type timestamp[ns]
In [10]: dt.datetime.strptime('2022-01-01T09:00:00.123', '%Y-%m-%dT%H:%M:%S')
...
ValueError: unconverted data remains: .123
yup, sounds fine to me |
I opened #50412 for the
To be clear, I don't say that those two examples should parse that. But so pandas does if you pass
Depending on the outcome of #50412, that's not actually a solution |
Ah I'd misunderstood which part you were considering a bug - #50412 makes it clear, thanks! OK I'll get back to you on this, will need to think about it carefully |
OK, it should work to do DatetimeIndex(dates) I don't think we need to add extra parameters then. This seems like a rare usecase anyway - In [6]: dataset = pl.DataFrame({"date": ["2020-01-02", "2020-01-03", "2020-01-04 00:00:00"], "index": [1, 2, 3]})
In [7]: dataset.with_column(pl.col("date").str.strptime(pl.Datetime))
Out[7]:
shape: (3, 2)
┌─────────────────────┬───────┐
│ date ┆ index │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2020-01-02 00:00:00 ┆ 1 │
│ 2020-01-03 00:00:00 ┆ 2 │
│ 2020-01-04 00:00:00 ┆ 3 │
└─────────────────────┴───────┘ |
We OK to close this one then? EDIT: I'll add better docs for this, but I don't think it warrants an extra parameter |
I would find it surprising that this works for |
Long-term (3.0?), here's what I'd really like to get to:
For 2.0, I'd suggest noting in the An alternative would be add an extra argument to
@jbrockmendel @mroeschke any thoughts on this? |
Semantically In my mind, always thought |
Sure, thanks for commenting How about if the So in the above example:
|
Yeah that would satisfy my aspiration for (FWIW my bold opinion is I think |
OK thanks @jorisvandenbossche if you have no objections I'll make a PR (either this or next week, but definitely before 2.0) |
Lots going on here. Without any non-shared keywords passed, I would expect DatetimeIndex and to_datetime to accept the same things and behave the same. That is not necessarily a good thing zen-of-python-wise. If that were to be changed, I'd want to make one strictly more strict than the other, not mix-and-match (i.e. avoid "DatetimeIndex accepts this but not that, to_datetime the other way around") I'd be on board with getting keywords out of the DatetimeIndex constructor and telling users to use to_datetime instead (also analogous for TDI/PI), but that might necessitate adding keywords to to_datetime which isn't great. (Deprecating string parsing in Timestamp is indeed bold, not sure what i think about that) Also: I'm finding more and more places where the dateutil parser is not strict enough. I'm leaning towards moving towards just supporting a small(ish) collection of formats and doing the parsing internally. |
Yeah, totally - I've love to get to the point when But for 2.0, what do you think we should do about mixed ISO8601 formats? OK with just allowing I don't think allowing mixed formats in |
I haven't paid much attention. Is the OP still the best summary of the topic? Since we're low on time, what's the least invasive option? |
Summary: there's no performant way to parse mixed ISO8601 strings with Least invasive option I can think of: allow |
similar to @jbrockmendel comment above, I would be +1 on make |
Sure, sounds good (maybe for 3.0) OK to support |
A few reactions:
|
As I wrote above, that sounds good to me (we have something similar in pyarrow). |
This would be performant, sorry - what wouldn't be performant is the currently-suggested alternative (ser.apply(lambda x: to_datetime(x))` And yes, it should be strict with respect to ISO8601 (so, the issue which PDEP4 addressed wouldn't be undone by this) Cool, I'll make a PR then, thanks all for your inputs |
Sounds good, great! |
-> opened #50894 for this |
I may be late to the party but just encountered this when troubleshooting pyarrow xfails: if a user calls to_datetime without passing |
What do you think should happen if
|
I'd expect it to attempt guessing a format for a fastpath, and fall back to go through array_to_datetime
Haven't looked at it too closely, but tests.extension.test_arrow.TestBaseParsing.test_EA_types has xfailed cased for
|
Then we'd be back to |
Yep, thats a tough one. Was this breakage discussed-and-dismissed in the lead up to PDEP4? In cases where PDEP4 breaks user code (e.g. the jan 13 case above) we should at the very least have an exception message that tells users how to fix it. |
I'm OK with mixed Iso8601 parsing It's mixed not-necessarily-iso8601 that I'm really keen to avoid |
pandas.Index.get_loc deprecated since pandas 1.4. Use pandas.Index.get_indexer instead. since pandas 2, to_datetime will not read sequences of timestamps if the formats are not identical, e.g. some timestamps have fractional seconds and others do not (..., 12:12:48.9, 12:12:49, 12:12:49.1, ...). In such codes, need to specify format="ISO8061" to handle such cases. see: pandas-dev/pandas#50411
In pandas < 2, parsing datetime strings that are all strictly ISO formatted strings (so there is no ambiguity) but have slightly different formatting because of a different resolution, works fine:
With the changes related to datetime parsing (PDEP-4), this will now infer the format from the first, and then parsing the second fails:
This is of course expected and can be explained with the changes in behaviour. But I do wonder if we want to allow some way to still parse such data in a fast way.
For this specific case, you can't specify a format since it is inconsistent. And before, this was actually fast because it took a fast ISO parsing path.
With the current dev version of pandas, I don't see any way to parse this in a performant way?
(this also came up in pyarrow's CI, it was an easy fix on our side to update the strings in the test, but still wanted to raise this as a specific case to consider, since with real world data you can't necessarily easily update your data)
cc @MarcoGorelli
The text was updated successfully, but these errors were encountered: