Skip to content

Datetime parsing (PDEP-4): allow mixture of ISO formatted strings? #50411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Dec 23, 2022 · 32 comments · Fixed by #50939
Closed

Datetime parsing (PDEP-4): allow mixture of ISO formatted strings? #50411

jorisvandenbossche opened this issue Dec 23, 2022 · 32 comments · Fixed by #50939
Labels
Datetime Datetime data dtype
Milestone

Comments

@jorisvandenbossche
Copy link
Member

In pandas < 2, parsing datetime strings that are all strictly ISO formatted strings (so there is no ambiguity) but have slightly different formatting because of a different resolution, works fine:

>>> pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"])
DatetimeIndex(['2022-01-01 09:00:00', '2022-01-02 09:00:00.123000'], dtype='datetime64[ns]', freq=None)

With the changes related to datetime parsing (PDEP-4), this will now infer the format from the first, and then parsing the second fails:

>>> pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"])
...
ValueError: time data "2022-01-02T09:00:00.123" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S"

This is of course expected and can be explained with the changes in behaviour. But I do wonder if we want to allow some way to still parse such data in a fast way.

For this specific case, you can't specify a format since it is inconsistent. And before, this was actually fast because it took a fast ISO parsing path.
With the current dev version of pandas, I don't see any way to parse this in a performant way?

(this also came up in pyarrow's CI, it was an easy fix on our side to update the strings in the test, but still wanted to raise this as a specific case to consider, since with real world data you can't necessarily easily update your data)

cc @MarcoGorelli

@jorisvandenbossche jorisvandenbossche added the Datetime Datetime data dtype label Dec 23, 2022
@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Dec 23, 2022
@jorisvandenbossche
Copy link
Member Author

This is also a bit inconsistent with the case when we cannot infer the format. In such a case, we actually still do parse each string individually (just raise a warning about doing that). For example from the original PR:

>>> pd.to_datetime(["20010101 123456", "201201"])
  UserWarning: Could not infer format, so each element will be parsed individually by `dateutil`. 
  To ensure parsing is consistent and as-expected, please specify a format.
DatetimeIndex(['2001-01-01 12:34:56', '2001-12-20 00:00:00'], dtype='datetime64[ns]', freq=None)

@MarcoGorelli
Copy link
Member

would format='%Y-%m-%dT%H:%M:%S', exact=False work for you?

@MarcoGorelli
Copy link
Member

that works for the example you posted, and is performant

In [8]: pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"], exact=False)
Out[8]: DatetimeIndex(['2022-01-01 09:00:00', '2022-01-02 09:00:00.123000'], dtype='datetime64[ns]', freq=None)

@jorisvandenbossche
Copy link
Member Author

That only worked because of the exact ordering that I used. Switching the values:

In [15]: pd.to_datetime(["2022-01-01T09:00:00.123", "2022-01-02T09:00:00"], exact=False)
...
ValueError: time data "2022-01-02T09:00:00" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S.%f"

In that case, you could still provide the shorter format manually:

In [16]: pd.to_datetime(["2022-01-01T09:00:00.123", "2022-01-02T09:00:00"], format="%Y-%m-%dT%H:%M:%S", exact=False)
Out[16]: DatetimeIndex(['2022-01-01 09:00:00.123000', '2022-01-02 09:00:00'], dtype='datetime64[ns]', freq=None)

But, I am wondering: isn't that another bug? I would have expected it would ignore the sub-second values (and so wanted to say that this is not a full replacement ..). But apparently we still do parse the seconds, while with exact=False I expect (based on the docstring explanation" that it finds the format in the full string, ignoring anything else.
But maybe I am just misunderstanding the purpose of exact=False (never used it myself in practice before).

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 23, 2022

I don't think it's a bug: neither stdlib nor pyarrow would allow that:

In [9]: pa.compute.strptime('2022-01-01T09:00:00.123', '%Y-%m-%dT%H:%M:%S', unit='ns')
...
ArrowInvalid: Failed to parse string: '2022-01-01T09:00:00.123' as a scalar of type timestamp[ns]

In [10]: dt.datetime.strptime('2022-01-01T09:00:00.123', '%Y-%m-%dT%H:%M:%S')
...
ValueError: unconverted data remains: .123

In that case, you could still provide the shorter format manually:

yup, sounds fine to me

@jorisvandenbossche
Copy link
Member Author

I opened #50412 for the exact=False inconsistencies.

I don't think it's a bug: neither stdlib nor pyarrow would allow that:

To be clear, I don't say that those two examples should parse that. But so pandas does if you pass exact=False. But let's continue that part of the discussion in #50412

In that case, you could still provide the shorter format manually:

yup, sounds fine to me

Depending on the outcome of #50412, that's not actually a solution

@MarcoGorelli
Copy link
Member

Ah I'd misunderstood which part you were considering a bug - #50412 makes it clear, thanks!

OK I'll get back to you on this, will need to think about it carefully

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 23, 2022

OK, it should work to do

DatetimeIndex(dates)

I don't think we need to add extra parameters then.

This seems like a rare usecase anyway - as a comparison, polars would error too however, polars does allow it (EDIT: my bad, I'd used pl.Date instead of pl.Datetime when writing this comment the first time):

In [6]: dataset = pl.DataFrame({"date": ["2020-01-02", "2020-01-03", "2020-01-04 00:00:00"], "index": [1, 2, 3]})

In [7]: dataset.with_column(pl.col("date").str.strptime(pl.Datetime))
Out[7]: 
shape: (3, 2)
┌─────────────────────┬───────┐
│ dateindex │
│ ------   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2020-01-02 00:00:001     │
│ 2020-01-03 00:00:002     │
│ 2020-01-04 00:00:003     │
└─────────────────────┴───────┘

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 23, 2022

We OK to close this one then?

EDIT: I'll add better docs for this, but I don't think it warrants an extra parameter

@jorisvandenbossche
Copy link
Member Author

OK, it should work to do DatetimeIndex(dates)

I would find it surprising that this works for DatetimeIndex(..) and not pd.to_datetime.
But that's maybe a more general question (for another issue): what is the idea to do about DatetimeIndex? PDEP-4 has been focused on making to_datetime strict, but should we apply everything discussed for to_datetime to DatetimeIndex as well (when passing strings)? Or do we want to explicitly keep DatetimeIndex more flexible?

@MarcoGorelli
Copy link
Member

Long-term (3.0?), here's what I'd really like to get to:

For 2.0, I'd suggest noting in the to_datetime doc (and in the user guide) that if someone has mixed-ISO8601 strings, then they should pass them directly to DatetimeIndex

An alternative would be add an extra argument to to_datetime, but I'm pretty -1 on that:

  • to_datetime is already quite complex
  • this opens the door to allowing other formats to be less strict

@jbrockmendel @mroeschke any thoughts on this?

@mroeschke
Copy link
Member

Semantically In my mind, always thought to_datetime should be the "more-generic" date parsing option compared to DatetimeIndex since to_datetime contains the myriad of parsing options, so I find it weird that DatetimeIndex would contain date-parsing functionality that to_datetime doesn't have.

@MarcoGorelli
Copy link
Member

Sure, thanks for commenting

How about if the format argument in to_datetime can take 'ISO8601', in which case it'll parse any ISO8601 string (without regard to exact)?

So in the above example:

pd.to_datetime(["2022-01-01T09:00:00", "2022-01-02T09:00:00.123"], format='ISO8601')

@mroeschke
Copy link
Member

Yeah that would satisfy my aspiration for to_datetime.

(FWIW my bold opinion is I think Timestamp & DatetimetimeIndex shouldn't accept strings or do any parsing and to_datetime should be meant for that)

@MarcoGorelli
Copy link
Member

OK thanks

@jorisvandenbossche if you have no objections I'll make a PR (either this or next week, but definitely before 2.0)

@jbrockmendel
Copy link
Member

Lots going on here.

Without any non-shared keywords passed, I would expect DatetimeIndex and to_datetime to accept the same things and behave the same. That is not necessarily a good thing zen-of-python-wise. If that were to be changed, I'd want to make one strictly more strict than the other, not mix-and-match (i.e. avoid "DatetimeIndex accepts this but not that, to_datetime the other way around")

I'd be on board with getting keywords out of the DatetimeIndex constructor and telling users to use to_datetime instead (also analogous for TDI/PI), but that might necessitate adding keywords to to_datetime which isn't great.

(Deprecating string parsing in Timestamp is indeed bold, not sure what i think about that)

Also:

I'm finding more and more places where the dateutil parser is not strict enough. I'm leaning towards moving towards just supporting a small(ish) collection of formats and doing the parsing internally.

@MarcoGorelli
Copy link
Member

Yeah, totally - I've love to get to the point when dateutil is only used for guessing formats, and not for parsing, and will try to get things started towards that after 2.0

But for 2.0, what do you think we should do about mixed ISO8601 formats? OK with just allowing format='ISO8601'?

I don't think allowing mixed formats in format=None is an option, as the format might be guessed to be non-ISO8601, and then there'd be no mechanism for allowing mixed formats (at least, not without removing dateutil)

@jbrockmendel
Copy link
Member

But for 2.0, what do you think we should do about mixed ISO8601 formats? OK with just allowing format='ISO8601'?

I haven't paid much attention. Is the OP still the best summary of the topic? Since we're low on time, what's the least invasive option?

@MarcoGorelli
Copy link
Member

Summary: there's no performant way to parse mixed ISO8601 strings with to_datetime, e.g. pd.to_datetime(['2020-01-01', '2020-01-01 00:00:00'])

Least invasive option I can think of: allow format='ISO8601', and then treat that like require_iso8601 used to

@jreback
Copy link
Contributor

jreback commented Jan 20, 2023

similar to @jbrockmendel comment above, I would be +1 on make DatetimeIndex as strict as possible and leave the parsing to to_datetime, e.g. limiting to i8 and datetime64 inputs, ideally removing any string parsing.

@MarcoGorelli
Copy link
Member

Sure, sounds good (maybe for 3.0)

OK to support format='ISO8601' in to_datetime and let it accept anything ISO8601?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 20, 2023

A few reactions:

  • Lets discuss the exact parsing capabilities we want to see long-term in general for DatetimeIndex (should it match to_datetime, only ISO formatted strings, or no strings at all? ..) in a separate issue
  • For the original issue here, I would certainly not want the (temporary) answer to be: "if you want more flexible parsing, use DatetimeIndex instead of to_datetime" (as suggested in a commen above). I think we all agree that long term DatetimeIndex at least shouldn't have more parsing features as to_datetime, so then we should also not point users to DatetimeIndex on the short-term for that.
  • I don't have a strong opinion on format="ISO8601". If we want to support this use case, I think that is a nice solution for specifying this.
    At the same time I don't know how important this use case is (I only noticed it in the pyarrow tests where we wrote those strings in longer/shorter ISO form just for convenience (to not pad everything with all zeros). I just checked, and the pyarrow csv reader actually has such an option to say "ISO8601" in general, and that will parse a mix of ISO formatted strings.
  • Short term for 2.0, for DatetimeIndex, I think we should still decide if we want to apply the PDEP-4 changes in string parsing also to DatetimeIndex (I will open a separate issue for this)

@jorisvandenbossche
Copy link
Member Author

OK to support format='ISO8601' in to_datetime and let it accept anything ISO8601?

As I wrote above, that sounds good to me (we have something similar in pyarrow).
One question though: above you say this won't be performant? (#50411 (comment)) And can it be strict? (allow a mixture of different ISO formats, but still only ISO compliant formats)

@MarcoGorelli
Copy link
Member

This would be performant, sorry - what wouldn't be performant is the currently-suggested alternative (ser.apply(lambda x: to_datetime(x))`

And yes, it should be strict with respect to ISO8601 (so, the issue which PDEP4 addressed wouldn't be undone by this)

Cool, I'll make a PR then, thanks all for your inputs

@jorisvandenbossche
Copy link
Member Author

Sounds good, great!

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 20, 2023

  • Short term for 2.0, for DatetimeIndex, I think we should still decide if we want to apply the PDEP-4 changes in string parsing also to DatetimeIndex (I will open a separate issue for this)

-> opened #50894 for this

@jbrockmendel
Copy link
Member

I may be late to the party but just encountered this when troubleshooting pyarrow xfails: if a user calls to_datetime without passing format and it fails with an exception that a format is not matched, that seems weird to me. It also isn't obvious what they are supposed to do instead. This is especially an important bc it can be reached via csv parsing when the user isn't calling to_datetime directly.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jan 20, 2023

What do you think should happen if format=None (the default) is passed?

read_csv wouldn't throw an error anyway because if it errors then it just doesn't convert, e.g.

In [4]: pd.read_csv(io.StringIO('date,value\n13-01-2000,3\n2000-01-01,4'), parse_dates=['date']).dtypes
<ipython-input-4-ec4c09ad3e02>:1: UserWarning: Parsing dates in %d-%m-%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  pd.read_csv(io.StringIO('date,value\n13-01-2000,3\n2000-01-01,4'), parse_dates=['date']).dtypes
Out[4]: 
date     object
value     int64
dtype: object

@jbrockmendel
Copy link
Member

What do you think should happen if format=None (the default) is passed?

I'd expect it to attempt guessing a format for a fastpath, and fall back to go through array_to_datetime

read_csv wouldn't throw an error anyway because if it errors then it just doesn't convert, e.g.

Haven't looked at it too closely, but tests.extension.test_arrow.TestBaseParsing.test_EA_types has xfailed cased for pa.types.is_timestamp(pa_dtype) and pa_dtype.unit in ("us", "ns") (that point to #49767) that are fixed if I go into the to_datetime code and do

+    guessed = False
    if format is None:
        format = _guess_datetime_format_for_array(arg, dayfirst=dayfirst)
+        guessed = True

    if format is not None:
+        try:
            return _array_strptime_with_fallback(arg, name, utc, format, exact, errors)
+        except AppropriateException:
+            if not guessed:
+                raise

@MarcoGorelli
Copy link
Member

I'd expect it to attempt guessing a format for a fastpath, and fall back to go through array_to_datetime

Then we'd be back to ['12/01/2000', '13/01/2000'] being parsed as "1st December 2000, 13th January 2000", which would bring back the decade-long bug which PDEP4 fixed

@jbrockmendel
Copy link
Member

which would bring back the decade-long bug which PDEP4 fixed

Yep, thats a tough one. Was this breakage discussed-and-dismissed in the lead up to PDEP4?

In cases where PDEP4 breaks user code (e.g. the jan 13 case above) we should at the very least have an exception message that tells users how to fix it.

@MarcoGorelli
Copy link
Member

I'm OK with mixed Iso8601 parsing

It's mixed not-necessarily-iso8601 that I'm really keen to avoid

thskaggs pushed a commit to usda-ars-ussl/fluxpart that referenced this issue Jan 12, 2024
pandas.Index.get_loc deprecated since pandas 1.4.
Use pandas.Index.get_indexer instead.

since pandas 2, to_datetime will not read sequences of timestamps if
the formats are not identical, e.g. some timestamps have fractional
seconds and others do not (..., 12:12:48.9, 12:12:49, 12:12:49.1, ...).
In such codes, need to specify format="ISO8061" to handle such cases.
see: pandas-dev/pandas#50411
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants