Skip to content

Conversation

kevinjqliu
Copy link
Contributor

@kevinjqliu kevinjqliu commented Aug 14, 2025

Rationale for this change

This PR fix reading pyarrow timestamp as Iceberg timestamptz type. It mirrors the pyarrow logic for dealing with pyarrow timestamp types here

Two changes were made to ArrowProjectionVisitor._cast_if_needed

  1. reorder the logic so that we handle dealing with timestamp first. Otherwise, it will try to promote() timestamp to timestamptz and fail.
  2. allow casting when the pyarrow's value has None timezone. This is allowed because we gate on the target type has "UTC" timezone. It mirrors the java logic for reading with default UTC timezone (1, 2)

Context

I ran into an interesting edge case while testing metadata virtualization between delta and iceberg.

Delta has both TIMESTAMP and TIMESTAMP_NTZ data types. TIMESTAMP has a timezone while TIMESTAMP_NTZ has no timezone.
While Iceberg has timestamp and timestamptz. timestamp has no timezone and timestamptz has a timezone.

So Delta's TIMESTAMP -> Iceberg timestamptz and Delta's TIMESTAMP_NTZ -> Iceberg timestamp.

Regardless of delta or iceberg, the parquet file stores timestamp without the timezone information

So I end up a parquet file with timestamp column, and an iceberg table with timestamptz column, and pyiceberg cannot read this table.
Its hard to recreate the scenario but i did trace it to the _to_requested_schema function. I added a unit test for this case.

The issue is that ArrowProjectionVisitor._cast_if_needed will try to promote timestamp to timstamptz and this is not a valid promotion.

E           pyiceberg.exceptions.ResolveError: Cannot promote timestamp to timestamptz

if field.field_type != file_field.field_type:
target_schema = schema_to_pyarrow(
promote(file_field.field_type, field.field_type), include_field_ids=self._include_field_ids
)

The elif case below that can handle this case

elif field.field_type == TimestamptzType():
if (
pa.types.is_timestamp(target_type)
and target_type.tz == "UTC"
and pa.types.is_timestamp(values.type)
and values.type.tz in UTC_ALIASES
):

So maybe we just need to switch the order of execution...

This was also an interesting read.. https://arrow.apache.org/docs/python/timestamps.html

Are these changes tested?

Are there any user-facing changes?

@kevinjqliu kevinjqliu requested a review from Fokko August 14, 2025 23:15
@kevinjqliu kevinjqliu added this to the PyIceberg 0.10.0 milestone Aug 15, 2025
@Fokko
Copy link
Contributor

Fokko commented Aug 18, 2025

Fun with timestamps, it is a gift that keeps on giving! :D

This is the reference page that I used most of the time: https://cwiki.apache.org/confluence/display/Hive/Different+TIMESTAMP+types

The thing is that, as you mentioned, we don't store the timestamp in Iceberg. And according to spec the timezone should not be set. But in Arrow we allow reading UTC timezones, for example, when you import an existing table then you don't want to rewrite all the data.

This is a bit of a slippery slope though, I think supporting UTC should be safe (that was the intend of #910), but anything other than UTC is not supported (and probably should be rewritten). Do we know how Java handles this?

@kevinjqliu
Copy link
Contributor Author

Thanks for the link @Fokko, that was very helpful.

I think supporting UTC should be safe (that was the intend of #910), but anything other than UTC is not supported (and probably should be rewritten).

I agree. I included a fix in this PR. It mirrors the logic reading timestamps types here

Do we know how Java handles this?

I think its this. Timestamptz is read as a long and converted to OffsetDateTime, with default UTC timetzone.

@@ -1802,13 +1795,22 @@ def _cast_if_needed(self, field: NestedField, values: pa.Array) -> pa.Array:
pa.types.is_timestamp(target_type)
and target_type.tz == "UTC"
and pa.types.is_timestamp(values.type)
and values.type.tz in UTC_ALIASES
and (values.type.tz in UTC_ALIASES or values.type.tz is None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also allow when type.tz is None, mirrors the logic here

if primitive.tz in UTC_ALIASES:
return TimestamptzType()
elif primitive.tz is None:
return TimestampType()

@kevinjqliu kevinjqliu changed the title add unit test for projecting timestamp to timestamptz fix: allow reading pyarrow timestamp as iceberg timestamptz Aug 19, 2025
@kevinjqliu
Copy link
Contributor Author

@Fokko please take another look when you get a chance.
Also cc @sungwy since you added #910

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we only allow UTC to be converted, I think we're safe adding this 👍

@kevinjqliu kevinjqliu merged commit bdf19ab into apache:main Aug 19, 2025
10 checks passed
@kevinjqliu kevinjqliu deleted the kevinjqliu/timestamp-to-timestamptz branch August 19, 2025 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants