Skip to content

Create Iceberg Table from pyarrow Schema with no IDs #278

Closed
@sungwy

Description

@sungwy

Feature Request / Improvement

I see three ways a user would want to create an Iceberg table:

  1. Completely manual - by specifying the schema, field by field
  2. By inferring the schema from an existing strongly-typed file or pyarrow table
  3. By copying the schema of an existing iceberg table (migration)

create_table function currently takes a pyiceberg.Schema as the input. The existing visitors support patterns (1) and (3), but not (2).

This is because the creation of a pyiceberg.Schema is only supported in the following two ways:

  1. From a pyarrow schema with valid field-id metadata
  2. using a NameMapping which have field-ids. Currently, the only way to create a NameMapping is by constructing it field-id by field-id, or by using a utility function on an existing Iceberg Schema.

Therefore, we need to update an existing Visitor, or create a new Visitor in order to support the generation of a pyiceberg.Schema from a pyarrow Schema with no IDs.

On #219 the following approaches have been discussed so far:

  1. Update _ConvertToIceberg to create a pyiceberg.Schema from pyarrow schema by assigning "-1" field_ids and use _SetFreshIDs to assign ordered fresh IDs. This idea unfortunately does not work as _SetFreshIDs requires different IDs to track each column and assign new ones.
  2. Create a new Visitor _CreateMappingFromPyArrowSchema that creates a name mapping from PyArrow schema and assigns fresh IDs if it does not have one. This is different from existing _CreateMapping visitor which is a pyiceberg Schema visitor.
  3. Use a separate visitor _ConvertToIcebergWithFreshIds which assigns fresh IDs based on the order of the fields' appearance in the pyarrow schema.

When we are entertaining different ideas to reduce code duplication in the new visitor, we need to keep in mind that the task of assigning fresh IDs works best in a pre-order traversal order. This is how _SetFreshIDs works now. All the existing schema visitors discussed above that construct the NameMapping or pyiceberg Schema are done in post-order traversal order.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions