Skip to content

ENH: allow requiring TZ-aware datetimes in to_datetime #54995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
lafrech opened this issue Sep 4, 2023 · 2 comments
Closed
1 of 3 tasks

ENH: allow requiring TZ-aware datetimes in to_datetime #54995

lafrech opened this issue Sep 4, 2023 · 2 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@lafrech
Copy link

lafrech commented Sep 4, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I get timeseries data through an API and I use Pandas to parse the CSV input into a dataframe.

I wouldn't mind users sending mixed timezone as long as datetimes are TZ-aware. However I don't want TZ-naive datetimes to be silently interpreted as UTC. Because in practice, the very likely case I already faced is users sending local time data as naive datetimes.

I wish there was an option in to_datetime to raise on naive datetimes.

Currently, I'm doing something like this:

def to_utc_index(index):
    """Create UTC datetime index from timezone aware datetime list"""

    # First parse without forcing UTC
    try:
        # Cast to series so that output is a series with an "apply" method we use during next step
        index = pd.to_datetime(pd.Series(index), format="ISO8601")
    except (
        ValueError,
        pd.errors.OutOfBoundsDatetime,
        pd._libs.tslibs.parsing.DateParseError,
    ) as exc:
        raise TimeseriesDataIODatetimeError("Invalid timestamp") from exc

    # Then convert to UTC manually
    # We can't just use tz_convert because it would also silently treat naive datetimes as UTC
    try:
        index = index.apply(lambda x: x.astimezone(dt.timezone.utc))
    except TypeError as exc:
        raise TimeseriesDataIODatetimeError(
            "Invalid or TZ-naive timestamp"
        ) from exc

    return pd.DatetimeIndex(index, name="timestamp")

data_df.index = to_utc_index(data_df.index)

This might not be efficient, but it's been working for a while.

Now to_datetime complains about mixed timezones and advocates the use of utc=True.

The problem with using utc=True is that the conversion will consider naive datetime as UTC. After the conversion, it is too late to check timezones. And before the conversion, the data is still a string, so checking the timezone requires either dealing with strings (use a regex, or perform checks that are format dependent) or instantiating datetimes and check TZ on datetimes, which is a pity performance-wise, and a step I'd be happy to avoid.

So currently, I'm left with the alternative of keeping current code and silencing the warning, which is not satisfying.

Having an option to raise on naive datetimes would help me get rid of most of the code I pasted above.

I didn't inspect the code to evaluate if it would be possible to achieve this in to_datetime in a more efficient way than the check I'm performing on each value, but even if it was not more efficient, it would already be an improvement not having to do it in user code.

Feature Description

This could be achieved by adding a raise_if_naive (better names welcome) argument.

Or perhaps an awareness argument (either None, "naive" or "aware") that would let the user specify what is expected in their input, None to disable the feature. This would allow users to accept only naive datetimes, which they may want to combine with utc=False to get a naive index.

Alternative Solutions

I could use a validation layer in front of pandas. In fact I used to validate input data with marshmallow but it is not efficient, so I dropped that part and now I just catch errors in pandas when building the dataframe.

Additional Context

Mixed TZ warning was added in #50887.

@lafrech lafrech added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2023
@mroeschke
Copy link
Member

Thanks for the request but it appears this hasn't gotten traction in a while so closing

@lafrech
Copy link
Author

lafrech commented Aug 24, 2024

It might be a bit specific, and I understand a lot of users don't care (users working on their own data, not with user-supplied data). But I still think this is relevant.

I'm sorry I didn't get the time to look into an implementation myself. Before anyone does, I think it would be nice to have a feedback about the idea.

Am I missing any other way to achieve what I'm trying to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants