-
Notifications
You must be signed in to change notification settings - Fork 356
fix: allow reading pyarrow timestamp as iceberg timestamptz #2333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: allow reading pyarrow timestamp as iceberg timestamptz #2333
Conversation
Fun with timestamps, it is a gift that keeps on giving! :D This is the reference page that I used most of the time: https://cwiki.apache.org/confluence/display/Hive/Different+TIMESTAMP+types The thing is that, as you mentioned, we don't store the timestamp in Iceberg. And according to spec the timezone should not be set. But in Arrow we allow reading UTC timezones, for example, when you import an existing table then you don't want to rewrite all the data. This is a bit of a slippery slope though, I think supporting UTC should be safe (that was the intend of #910), but anything other than UTC is not supported (and probably should be rewritten). Do we know how Java handles this? |
Thanks for the link @Fokko, that was very helpful.
I agree. I included a fix in this PR. It mirrors the logic reading timestamps types here
I think its this. Timestamptz is read as a long and converted to |
@@ -1802,13 +1795,22 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array: | |||
pa.types.is_timestamp(target_type) | |||
and target_type.tz == "UTC" | |||
and pa.types.is_timestamp(values.type) | |||
and values.type.tz in UTC_ALIASES | |||
and (values.type.tz in UTC_ALIASES or values.type.tz is None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also allow when type.tz is None, mirrors the logic here
iceberg-python/pyiceberg/io/pyarrow.py
Lines 1350 to 1353 in 8b43eb8
if primitive.tz in UTC_ALIASES: | |
return TimestamptzType() | |
elif primitive.tz is None: | |
return TimestampType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we only allow UTC to be converted, I think we're safe adding this 👍
Rationale for this change
This PR fix reading pyarrow timestamp as Iceberg timestamptz type. It mirrors the pyarrow logic for dealing with pyarrow timestamp types here
Two changes were made to
ArrowProjectionVisitor._cast_if_needed
promote()
timestamp to timestamptz and fail.None
timezone. This is allowed because we gate on the target type has "UTC" timezone. It mirrors the java logic for reading with default UTC timezone (1, 2)Context
I ran into an interesting edge case while testing metadata virtualization between delta and iceberg.
Delta has both TIMESTAMP and TIMESTAMP_NTZ data types. TIMESTAMP has a timezone while TIMESTAMP_NTZ has no timezone.
While Iceberg has timestamp and timestamptz. timestamp has no timezone and timestamptz has a timezone.
So Delta's TIMESTAMP -> Iceberg timestamptz and Delta's TIMESTAMP_NTZ -> Iceberg timestamp.
Regardless of delta or iceberg, the parquet file stores timestamp without the timezone information
So I end up a parquet file with timestamp column, and an iceberg table with timestamptz column, and pyiceberg cannot read this table.
Its hard to recreate the scenario but i did trace it to the
_to_requested_schema
function. I added a unit test for this case.The issue is that
ArrowProjectionVisitor._cast_if_needed
will try to promotetimestamp
totimstamptz
and this is not a valid promotion.iceberg-python/pyiceberg/io/pyarrow.py
Lines 1779 to 1782 in 640c592
The
elif
case below that can handle this caseiceberg-python/pyiceberg/io/pyarrow.py
Lines 1800 to 1806 in 640c592
So maybe we just need to switch the order of execution...
This was also an interesting read.. https://arrow.apache.org/docs/python/timestamps.html
Are these changes tested?
Are there any user-facing changes?