Skip to content

variant_schema: Initial implementation#24

Closed
sdf-jkl wants to merge 29 commits intodatafusion-contrib:mainfrom
sdf-jkl:schema
Closed

variant_schema: Initial implementation#24
sdf-jkl wants to merge 29 commits intodatafusion-contrib:mainfrom
sdf-jkl:schema

Conversation

@sdf-jkl
Copy link
Collaborator

@sdf-jkl sdf-jkl commented Dec 22, 2025

Which issue does this PR close?

Rationale for this change

Tried to implement schema_of_variant and schema_of_variant_agg into a single udf - variant_schema

What changes are included in this PR?

Adding a new ScalarUdf variant_schema to extract aggregate schema from a scalar Variant or VariantArray

Are these changes tested?

  • Tested scalar values with most types and arrays with conflicting and valid schemas.
  • Added some sqllogictests

Copy link
Collaborator Author

@sdf-jkl sdf-jkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick run after submitting the PR

/// - A field becomes VARIANT if its values are incompatible
///
#[derive(Debug, PartialEq, Eq, Clone)]
enum VariantSchema {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to Variant not being a first-class type in Arrow I had to use enums to represent extracted types.

Variant::TimestampNtzMicros(_) => DataType::Timestamp(TimeUnit::Microsecond, None),
Variant::TimestampNanos(_) => DataType::Timestamp(TimeUnit::Nanosecond, Some("utc".into())),
Variant::TimestampNtzNanos(_) => DataType::Timestamp(TimeUnit::Nanosecond, None),
_ => unreachable!("Should be only applied to Primitive Variant, not Object or List"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably remove unreachable!

@sdf-jkl
Copy link
Collaborator Author

sdf-jkl commented Feb 27, 2026

@friendlymatthew I feel like this is such a mess.

I should split this into multiple PRs handling just the initial implementations of variant_schema and variant_schema_agg functions. Later add extras on top of them.

The order should be variant_schema -> variant_schema_agg -> type widening, etc.

@friendlymatthew
Copy link
Member

@friendlymatthew I feel like this is such a mess.

I should split this into multiple PRs handling just the initial implementations of variant_schema and variant_schema_agg functions. Later add extras on top of them.

The order should be variant_schema -> variant_schema_agg -> type widening, etc.

Hi, can you explain why we need type widening?

@sdf-jkl
Copy link
Collaborator Author

sdf-jkl commented Mar 2, 2026

From databricks docs for schema_of_variant_agg:

When two fields with the same name have a different type across records, Databricks uses the least common type. When no such type exists, the type is derived as a VARIANT. For example, INT and DOUBLE become DOUBLE, while TIMESTAMP and STRING become VARIANT.

We also want to do the same thing
for VariantList in the scalar version variant_schema.

@sdf-jkl sdf-jkl closed this Mar 2, 2026
@sdf-jkl sdf-jkl reopened this Mar 2, 2026
@friendlymatthew
Copy link
Member

From databricks docs for schema_of_variant_agg:

When two fields with the same name have a different type across records, Databricks uses the least common type. When no such type exists, the type is derived as a VARIANT. For example, INT and DOUBLE become DOUBLE, while TIMESTAMP and STRING become VARIANT.

We also want to do the same thing for VariantList in the scalar version variant_schema.

Makes sense to me. Let me know how you want to proceed. I'm fine with breaking this up into smaller PRs, pushing the scalar version first, then working on variant_agg and type widening later

@sdf-jkl
Copy link
Collaborator Author

sdf-jkl commented Mar 2, 2026

I'll do that. Otherwise it's too much code to review.

@sdf-jkl sdf-jkl closed this Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

variant_schema: Initial implementation

2 participants