Description
This is in part a question and open for discussion.
When building TableMetadata
through the TableMetadataBuilder
, all options of building "from scratch" force a reassignment of field IDs:
- Using
TableMetadataBuilder::new
- Using
TableMetadataBuilder::from_table_creation
, as this is a wrapper overTableMetadataBuilder::new
using theTableCreation
struct.
I noticed that it would be possible to get any type of TableMetadata
that was desired through using the object directly, but all of the fields are restricted to pub(crate)
scope. I suspect the reason for this is safety, i.e. ensuring that creation occurs through the builder pattern where the relevant checks are performed on call to build()
.
Questions:
- Would it be problematic to lift the restriction on the
TableMetadata
fields to bepub
1 or allow the creation ofTableMetadata
without reassigning field IDs? - If the above is not possible, is there an example of creating the iceberg metadata file hierarchy in the correct way?
For extra context, we're currently constructing Iceberg metadata around pre-existing parquet files written by another system; however, there is no Iceberg catalog or prior metadata JSON. I noticed there is also a StaticTable
; however, this requires either pre-existing JSON from FileIO or an input TableMetadata
, this 2nd option brings us back to the above issue.
This assignment leads to a mismatch in what is shown in the table metadata JSON vs the actual parquet file:
parquet schema
required group field_id=-1 arrow_schema {
optional binary field_id=2 cpu (String);
optional binary field_id=3 host1 (String);
optional int64 field_id=1 time (Timestamp(isAdjustedToUTC=false, timeUnit=microseconds, is_from_converted_type=false, force_set_converted_type=false));
}
iceberg metadata JSON schema snippet
This reassignment occurs to the order that they appear within the parquet/arrow Schema
, rather than the given field IDs.
"schemas": [
{
"schema-id": 0,
"type": "struct",
"fields": [
{
"id": 1, <-- field_id=2 in parquet
"name": "cpu",
"required": false,
"type": "string"
},
{
"id": 2, <-- field_id=3 in parquet
"name": "host1",
"required": false,
"type": "string"
},
{
"id": 3, <-- field_id=1 in parquet
"name": "time",
"required": false,
"type": "timestamp"
}
]
}
],
This is also referenced by a question in the iceberg slack
Footnotes
-
Considering this conflicts with the native Java implementation, I would also suspect it is problematic to do in the Rust version. ↩