-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat(sparksql): Add transform with index parameter support #16211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(sparksql): Add transform with index parameter support #16211
Conversation
✅ Deploy Preview for meta-velox canceled.
|
878130c to
7254aa4
Compare
Spark's transform function supports two signatures: 1. transform(array, x -> expr) - element only 2. transform(array, (x, i) -> expr) - element + index Velox's Presto implementation only supports signature #1. This adds a Spark-specific implementation that supports both signatures. Key changes: - Created velox/functions/sparksql/Transform.cpp with SparkTransformFunction that detects lambda arity at runtime and provides index vector when needed - Uses 32-bit integer for index (matching Spark's IntegerType) - Added comprehensive tests for both signatures including null handling Tested with 6 new test cases, all passing.
7254aa4 to
0b50680
Compare
jinchengchenghh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feature. Please also update the document.
| } | ||
| }; | ||
|
|
||
| // Test transform with element-only lambda: transform(array, x -> x * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End with .
| namespace facebook::velox::functions::sparksql { | ||
| namespace { | ||
|
|
||
| /// Spark's transform function supporting both signatures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
supporting -> supports
| /// | ||
| /// See Spark documentation: | ||
| /// https://spark.apache.org/docs/latest/api/sql/index.html#transform | ||
| class SparkTransformFunction : public exec::VectorFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove Spark prefix since it has been in sparksql namespace
| /// https://spark.apache.org/docs/latest/api/sql/index.html#transform | ||
| class SparkTransformFunction : public exec::VectorFunction { | ||
| public: | ||
| void apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we reuse some code of udf_transform, move some of them to common path velox/functions/lib?
- Rename SparkTransformFunction to TransformFunction (in sparksql namespace) - Fix grammar: 'supporting' -> 'supports' - Add period at end of test comment
|
I analyzed the code reuse opportunity. The common utilities are already shared in
The Spark-specific additions are:
Moving the remaining lambda iteration loop to lib would add complexity without significant benefit since it's ~15 lines of straightforward code. The current structure follows the same pattern as other Spark-specific lambda functions like If you'd prefer a different approach (e.g., a templated base class), I'm happy to refactor. |
| std::vector<VectorPtr> lambdaArgs = {flatArray->elements()}; | ||
|
|
||
| // If lambda expects index, create index vector. | ||
| if (withIndex) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the only change with presto TransformFunction is here, am I right? So I would prefer a base TransformFunction and add a virtual addIndexVectors(lambdaArgs) with default empty implementation,
| context.moveOrCopyResult(localResult, rows, result); | ||
| } | ||
|
|
||
| static std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the signatures, call the super function signatures() and add the spark specified function signature
Address reviewer feedback to share code between Presto and Spark transform: - Create TransformFunctionBase in velox/functions/lib/ with virtual addIndexVector() - Presto TransformFunction inherits base directly (no override needed) - Spark TransformFunction overrides addIndexVector() to add index parameter - Spark signatures() calls base and adds the (T, integer, U) signature All 13 transform tests pass (6 Spark + 7 Presto).
cc7afe8 to
c28d2c1
Compare
Summary
Adds Spark-specific
transformfunction that supports the 2-arg lambda signaturetransform(array, (x, i) -> expr)with element + index.Background
Spark's transform function supports two signatures:
transform(array, x -> expr)- element onlytransform(array, (x, i) -> expr)- element + indexVelox's Presto implementation only supports signature #1. This PR adds a Spark-specific implementation that supports both.
Changes
velox/functions/sparksql/Transform.cppwithSparkTransformFunctionTest Plan
All 6 new tests pass:
elementOnly- basic transform without indexelementWithIndex- transform using element + indexwithNulls- null handling in arraysnestedArrays- nested array transformationindexOnly- using only index (ignore element)emptyArray- empty array edge case