Skip to content

[SPARK-52451][CONNECT][SQL] Make WriteOperation in SparkConnectPlanner side effect free #51727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

heyihong
Copy link
Contributor

@heyihong heyihong commented Jul 30, 2025

What changes were proposed in this pull request?

This PR refactors the Spark Connect execution flow to make WriteOperation handling side-effect free by separating the transformation and execution phases. The key changes include:

  1. Unified execution flow: Consolidated ROOT and COMMAND operations through SparkConnectPlanExecution.handlePlan() instead of separate handlers
  2. Pure transformation phase: Introduced transformCommand() that converts WriteOperation to LogicalPlan without side effects. It leverages the new DataFrameWriter methods (saveCommand(), saveAsTableCommand(), insertIntoCommand()), which return logical plans instead of executing immediately.
  3. DataFrameWriter refactoring: The refactor adds new DataFrameWriter methods—saveCommand(), saveAsTableCommand(), and insertIntoCommand()—that return logical plans, and it introduces a new SaveAsV1TableCommand.

Why are the changes needed?

The current implementation has several issues:

  1. Side effects in transformation: The handleWriteOperation method both transforms and executes write operations, making it difficult to reason about the transformation logic independently.

  2. Code duplication: Separate handling paths for ROOT and COMMAND operations in ExecuteThreadRunner create unnecessary complexity and potential inconsistencies.

Does this PR introduce any user-facing change?

No. This is a purely internal refactoring that maintains the same external behavior and API. All existing Spark Connect client code will continue to work without any changes.

How was this patch tested?

build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite"

Was this patch authored or co-authored using generative AI tooling?

Cursor 1.3.5

@heyihong heyihong force-pushed the SPARK-52451 branch 2 times, most recently from 73aacd0 to b510901 Compare July 30, 2025 16:28
@heyihong heyihong changed the title [SPARK-52451][CONNECT] Make WriteOperation in SparkConnectPlanner side effect free [SPARK-52451][CONNECT][SQL] Make WriteOperation in SparkConnectPlanner side effect free Jul 30, 2025
@heyihong
Copy link
Contributor Author

@heyihong heyihong force-pushed the SPARK-52451 branch 2 times, most recently from 82c9d71 to 648ff88 Compare July 30, 2025 21:01

val refreshTablePlan = RefreshTableCommand(qualifiedIdent)

CompoundBody(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really work with runCommand?

Copy link
Contributor Author

@heyihong heyihong Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, CompoundBody is actually executed during the analysis phase. But the tracker doesn't seem to get updated correctly. So I made a small fix in this pr

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about using CompoundBody outside of SQL script execution, and the fix in https://github.com/apache/spark/pull/51727/files#diff-a3fb0a56a7d2d08dc87434eff5b43aba0b006b59ed2f25e29bfb6fb4f81ec0c4R164 is a bit suspicious.

Can we create a new command SaveAsTableCommand to do these operations?

@cloud-fan
Copy link
Contributor

The idea LGTM, we can simplify it futher in the future by using a simple logical plan for each DataFrameWriter API. Then Spark Connect can just use that logical plan, instead of calling DataFrameWriter to generate the logical plan.

@heyihong heyihong force-pushed the SPARK-52451 branch 3 times, most recently from 563d5d6 to d20692d Compare August 1, 2025 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants