feat(sparksql): Add transform with index parameter support#16211
feat(sparksql): Add transform with index parameter support#16211yaooqinn wants to merge 3 commits intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
878130c to
7254aa4
Compare
Spark's transform function supports two signatures: 1. transform(array, x -> expr) - element only 2. transform(array, (x, i) -> expr) - element + index Velox's Presto implementation only supports signature #1. This adds a Spark-specific implementation that supports both signatures. Key changes: - Created velox/functions/sparksql/Transform.cpp with SparkTransformFunction that detects lambda arity at runtime and provides index vector when needed - Uses 32-bit integer for index (matching Spark's IntegerType) - Added comprehensive tests for both signatures including null handling Tested with 6 new test cases, all passing.
7254aa4 to
0b50680
Compare
jinchengchenghh
left a comment
There was a problem hiding this comment.
Thanks for your feature. Please also update the document.
| } | ||
| }; | ||
|
|
||
| // Test transform with element-only lambda: transform(array, x -> x * 2) |
| namespace facebook::velox::functions::sparksql { | ||
| namespace { | ||
|
|
||
| /// Spark's transform function supporting both signatures: |
There was a problem hiding this comment.
supporting -> supports
| /// | ||
| /// See Spark documentation: | ||
| /// https://spark.apache.org/docs/latest/api/sql/index.html#transform | ||
| class SparkTransformFunction : public exec::VectorFunction { |
There was a problem hiding this comment.
Please remove Spark prefix since it has been in sparksql namespace
| /// https://spark.apache.org/docs/latest/api/sql/index.html#transform | ||
| class SparkTransformFunction : public exec::VectorFunction { | ||
| public: | ||
| void apply( |
There was a problem hiding this comment.
Could we reuse some code of udf_transform, move some of them to common path velox/functions/lib?
- Rename SparkTransformFunction to TransformFunction (in sparksql namespace) - Fix grammar: 'supporting' -> 'supports' - Add period at end of test comment
|
I analyzed the code reuse opportunity. The common utilities are already shared in
The Spark-specific additions are:
Moving the remaining lambda iteration loop to lib would add complexity without significant benefit since it's ~15 lines of straightforward code. The current structure follows the same pattern as other Spark-specific lambda functions like If you'd prefer a different approach (e.g., a templated base class), I'm happy to refactor. |
| std::vector<VectorPtr> lambdaArgs = {flatArray->elements()}; | ||
|
|
||
| // If lambda expects index, create index vector. | ||
| if (withIndex) { |
There was a problem hiding this comment.
Looks like the only change with presto TransformFunction is here, am I right? So I would prefer a base TransformFunction and add a virtual addIndexVectors(lambdaArgs) with default empty implementation,
| context.moveOrCopyResult(localResult, rows, result); | ||
| } | ||
|
|
||
| static std::vector<std::shared_ptr<exec::FunctionSignature>> signatures() { |
There was a problem hiding this comment.
And the signatures, call the super function signatures() and add the spark specified function signature
Address reviewer feedback to share code between Presto and Spark transform: - Create TransformFunctionBase in velox/functions/lib/ with virtual addIndexVector() - Presto TransformFunction inherits base directly (no override needed) - Spark TransformFunction overrides addIndexVector() to add index parameter - Spark signatures() calls base and adds the (T, integer, U) signature All 13 transform tests pass (6 Spark + 7 Presto).
cc7afe8 to
c28d2c1
Compare
Summary
Adds Spark-specific
transformfunction that supports the 2-arg lambda signaturetransform(array, (x, i) -> expr)with element + index.Background
Spark's transform function supports two signatures:
transform(array, x -> expr)- element onlytransform(array, (x, i) -> expr)- element + indexVelox's Presto implementation only supports signature #1. This PR adds a Spark-specific implementation that supports both.
Changes
velox/functions/sparksql/Transform.cppwithSparkTransformFunctionTest Plan
All 6 new tests pass:
elementOnly- basic transform without indexelementWithIndex- transform using element + indexwithNulls- null handling in arraysnestedArrays- nested array transformationindexOnly- using only index (ignore element)emptyArray- empty array edge case