-
Notifications
You must be signed in to change notification settings - Fork 85
don't fully enforce matching schema #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
don't fully enforce matching schema #210
Conversation
FYI spark does document this behavior:
|
After reading, I am not completely sure, how specifically this is meant. When it says columns, does that imply all leaf columns, or just top level columns, or everything recursively? i.e. just applying to all leafs might yield incorrect data (e.g. map keys). So my guess would be just roots? |
My understanding is: Any time we have a struct containing non-null fields which is itself a nullable field (whether in a parent struct or top-level schema), then parquet cannot express that situation. Writers have to compensate by forcing all non-null-with-nullable-ancestor fields to be nullable. It is not necessary to make the entire schema nullable, but I'm guessing many writers do so because it's simpler? |
// use a `map` here so rustc doesn't have to infer the error type | ||
output_data_type.map(|output_type| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is avoiding the double-?
issue we hit before?
(but if ensure_data_types
is really just a validator rather than a mapper... a lot of this code can probably go away?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to fix an error like:
error[E0282]: type annotations needed
--> kernel/src/engine/arrow_expression.rs:256:21
|
256 | Ok(ArrowField::new(input_field.name(), array.data_type().clone(), array.is_nullable()))
| ^^ cannot infer type of the type parameter `E` declared on the enum `Result`
|
help: consider specifying the generic arguments
|
256 | Ok::<arrow_schema::Field, E>(ArrowField::new(input_field.name(), array.data_type().clone(), array.is_nullable()))
| ++++++++++++++++++++++++++
I've fixed it slightly differently to avoid having to use the turbofish.
2c66a85
to
98c2803
Compare
if kernel_fields.fields.len() == arrow_fields.len() { | ||
for (kernel_field, arrow_field) in kernel_fields.fields().zip(arrow_fields.iter()) { | ||
ensure_data_types(&kernel_field.data_type, arrow_field.data_type())?; | ||
} | ||
Ok(()) | ||
} else { | ||
Err(make_arrow_error(format!( | ||
"Struct types have different numbers of fields. Expected {}, got {}", | ||
kernel_fields.fields.len(), | ||
arrow_fields.len() | ||
))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Seems like a good place for require!
followed by the for-loop?
}) | ||
ensure_data_types(input_field.data_type(), array.data_type())?; | ||
// need to help type inference a bit so it knows what the error type is | ||
let res: DeltaResult<ArrowField> = Ok(ArrowField::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I suspect this would also work:
.map(|(array, input_field)| -> DeltaResult<_> {
ensure_data_types(...)?;
Ok(ArrowFIeld::new(...));
}
98c2803
to
26749e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
When we read from a checkpoint, there can be disagreement about the schema we think we've read, and what was actually in the parquet. This can cause issues when we try and interact with engine data via expressions. For example here, where we use the "correct" schema and do not mark dvs as nullable. Also, in our arrow_conversions we make assumptions about the names of the fields that mark map keys and values (see here), which also causes issues when the actual materialized names are different.
An example error trying to work with a checkpoint file:
This arises for two reasons:
This PR does as much validation as possible, but doesn't check things we can't control. So for each named field in the output schema it ensures the types are the same, and does so recursively into structs, maps, and lists.
Then it simply uses the schema that the parquet/json reader had already associated with the data.
This "works". Major issues:
After some discussion, we will go with this for the time being.