-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-52494] Support colon-sign operator syntax to access Variant fields #51190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
||
/** | ||
* Represents the extraction of data from a field that contains semi-structured data. The | ||
* semi-structured format can be anything (JSON, key-value delimited, etc), and that information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it can be VARIANT only now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -0,0 +1,13 @@ | |||
-- Simple field extraction and type casting. | |||
select parse_json('{ "price": 5 }'):price; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can create a temp view with one or more VARIANT columns, to simplify the other SELECT queries in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. Done.
-- Applying an invalid function. | ||
select parse_json('{ "price": 12345.678 }'):price::decimal(3, 2); | ||
-- Access field in an array and feed it into functions. | ||
select parse_json('{ "item": [ { "model" : "basic", "price" : 6.12 }, { "model" : "medium", "price" : 9.24 } ] }'):item[0].price::double; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's test all the valid syntaxes, e.g. ASTERISK
, brackets with string, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more syntaxes -- PTAL.
BTW I intentionally didn't add things like "using ':' in group by", e.g.:
select multi_field_variant:city::string, count(*) from variant_test_data group by multi_field_variant:city::string
Because the "group by" here has to use the result after an explicit type cast (::string), making these tests less interesting. But please let me know if you feel covering those cases would still be helpful, and I can add those in.
What changes were proposed in this pull request?
Adds support for accessing fields inside a Variant data type through the colon-sign operator. The syntax is documented here: https://docs.databricks.com/aws/en/sql/language-manual/functions/colonsign
Why are the changes needed?
Provides a convenient way to access fields inside a Variant via SQL.
Does this PR introduce any user-facing change?
Yes -- The previously invalid (would throw ParseException) syntax is now supported.
=== In Scala Spark shell:
Before:
After:
=== In PySpark REPL:
Before:
== SQL ==
select parse_json('{ "price": 5 }'):price::int
-----------------------------------^^^
After:
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No