don't fully enforce matching schema #210

nicklan · 2024-05-18T00:00:55Z

When we read from a checkpoint, there can be disagreement about the schema we think we've read, and what was actually in the parquet. This can cause issues when we try and interact with engine data via expressions. For example here, where we use the "correct" schema and do not mark dvs as nullable. Also, in our arrow_conversions we make assumptions about the names of the fields that mark map keys and values (see here), which also causes issues when the actual materialized names are different.

An example error trying to work with a checkpoint file:

Arrow(
    InvalidArgumentError(
        "Incorrect datatype for StructArray field \"deletionVector\", expected Struct([Field { name: \"storageType\", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"pathOrInlineDv\", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"offset\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"sizeInBytes\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"cardinality\", data_type: Int64, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]) got Struct([Field { name: \"storageType\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"pathOrInlineDv\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"offset\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"sizeInBytes\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"cardinality\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"maxRowIndex\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])",
    ),
)

This arises for two reasons:

To avoid "parquet corruption", spark will write a nullable struct as having all fields nullable. Thus, for example, since DVs are nullable, all fields of the DV struct are marked nullable. This means the "nullability" of fields will not be the same as the specified read schema
We can't guarantee exactly what names the "meta" fields for maps/lists will have in the raw schema

This PR does as much validation as possible, but doesn't check things we can't control. So for each named field in the output schema it ensures the types are the same, and does so recursively into structs, maps, and lists.

Then it simply uses the schema that the parquet/json reader had already associated with the data.

This "works". Major issues:

The schema you ask for might not be exactly the schema you get. Although differences in nullability are the only "observable" difference from kernel perspective
A bigger issue is that this complicates the writing of expression evaluators for engines, as we'll have to carefully document and explain potential schema mismatches.

After some discussion, we will go with this for the time being.

scovich · 2024-05-20T15:48:44Z

FYI spark does document this behavior:

When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

roeap · 2024-05-20T18:12:05Z

FYI spark does document this behavior:

When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

After reading, I am not completely sure, how specifically this is meant. When it says columns, does that imply all leaf columns, or just top level columns, or everything recursively? i.e. just applying to all leafs might yield incorrect data (e.g. map keys). So my guess would be just roots?

scovich · 2024-05-20T18:47:45Z

FYI spark does document this behavior:

When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

After reading, I am not completely sure, how specifically this is meant. When it says columns, does that imply all leaf columns, or just top level columns, or everything recursively? i.e. just applying to all leafs might yield incorrect data (e.g. map keys). So my guess would be just roots?

My understanding is: Any time we have a struct containing non-null fields which is itself a nullable field (whether in a parent struct or top-level schema), then parquet cannot express that situation. Writers have to compensate by forcing all non-null-with-nullable-ancestor fields to be nullable. It is not necessary to make the entire schema nullable, but I'm guessing many writers do so because it's simpler?

kernel/src/engine/arrow_expression.rs

scovich · 2024-05-20T18:59:43Z

kernel/src/engine/arrow_expression.rs

+                    // use a `map` here so rustc doesn't have to infer the error type
+                    output_data_type.map(|output_type| {


This is avoiding the double-? issue we hit before?

(but if ensure_data_types is really just a validator rather than a mapper... a lot of this code can probably go away?)

It's to fix an error like:

error[E0282]: type annotations needed --> kernel/src/engine/arrow_expression.rs:256:21 | 256 | Ok(ArrowField::new(input_field.name(), array.data_type().clone(), array.is_nullable())) | ^^ cannot infer type of the type parameter `E` declared on the enum `Result` | help: consider specifying the generic arguments | 256 | Ok::<arrow_schema::Field, E>(ArrowField::new(input_field.name(), array.data_type().clone(), array.is_nullable())) | ++++++++++++++++++++++++++

I've fixed it slightly differently to avoid having to use the turbofish.

scovich · 2024-05-22T12:30:51Z

kernel/src/engine/arrow_expression.rs

+            if kernel_fields.fields.len() == arrow_fields.len() {
+                for (kernel_field, arrow_field) in kernel_fields.fields().zip(arrow_fields.iter()) {
+                    ensure_data_types(&kernel_field.data_type, arrow_field.data_type())?;
+                }
+                Ok(())
+            } else {
+                Err(make_arrow_error(format!(
+                    "Struct types have different numbers of fields. Expected {}, got {}",
+                    kernel_fields.fields.len(),
+                    arrow_fields.len()
+                )))


nit: Seems like a good place for require! followed by the for-loop?

scovich · 2024-05-22T12:34:56Z

kernel/src/engine/arrow_expression.rs

-                    })
+                    ensure_data_types(input_field.data_type(), array.data_type())?;
+                    // need to help type inference a bit so it knows what the error type is
+                    let res: DeltaResult<ArrowField> = Ok(ArrowField::new(


nit: I suspect this would also work:

.map(|(array, input_field)| -> DeltaResult<_> { ensure_data_types(...)?; Ok(ArrowFIeld::new(...)); }

roeap

👍

nicklan requested review from roeap, rtyler, scovich and zachschuermann May 18, 2024 00:40

scovich reviewed May 20, 2024

View reviewed changes

nicklan force-pushed the align-arrow-and-kernel-schema-in-expressions branch from 2c66a85 to 98c2803 Compare May 20, 2024 19:54

nicklan requested a review from samansmink May 20, 2024 19:55

scovich approved these changes May 22, 2024

View reviewed changes

nicklan added 8 commits May 24, 2024 12:47

don't fully enforce matching schema

e456bda

not so many returns

cf550b9

binary also isn't primative

b35061b

better form of matching

3fd484a

one fewer returns

c93b65d

validate struct type field lens

cbf1a4e

cleanup

af135a7

fixups

26749e0

nicklan force-pushed the align-arrow-and-kernel-schema-in-expressions branch from 98c2803 to 26749e0 Compare May 24, 2024 19:52

cmts

c05417f

nicklan marked this pull request as ready for review May 24, 2024 19:56

fix comment typo

614ea7d

roeap approved these changes May 27, 2024

View reviewed changes

nicklan merged commit 3641f77 into delta-io:main May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

don't fully enforce matching schema #210

don't fully enforce matching schema #210

Uh oh!

nicklan commented May 18, 2024 •

edited

Loading

Uh oh!

scovich commented May 20, 2024

Uh oh!

roeap commented May 20, 2024

Uh oh!

scovich commented May 20, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

scovich May 20, 2024

Uh oh!

nicklan May 20, 2024

Uh oh!

scovich May 22, 2024

Uh oh!

scovich May 22, 2024

Uh oh!

roeap left a comment

Uh oh!

Uh oh!

		// use a `map` here so rustc doesn't have to infer the error type
		output_data_type.map(\|output_type\| {

don't fully enforce matching schema #210

don't fully enforce matching schema #210

Uh oh!

Conversation

nicklan commented May 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scovich commented May 20, 2024

Uh oh!

roeap commented May 20, 2024

Uh oh!

scovich commented May 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scovich May 20, 2024

Choose a reason for hiding this comment

Uh oh!

nicklan May 20, 2024

Choose a reason for hiding this comment

Uh oh!

scovich May 22, 2024

Choose a reason for hiding this comment

Uh oh!

scovich May 22, 2024

Choose a reason for hiding this comment

Uh oh!

roeap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nicklan commented May 18, 2024 •

edited

Loading

scovich commented May 20, 2024 •

edited

Loading