Skip to content

fix process-worker-shutdown-crashed-state#21158

Open
jellyfish0316 wants to merge 3 commits intoPrefectHQ:mainfrom
jellyfish0316:fix/process-worker-shutdown-crashed-state
Open

fix process-worker-shutdown-crashed-state#21158
jellyfish0316 wants to merge 3 commits intoPrefectHQ:mainfrom
jellyfish0316:fix/process-worker-shutdown-crashed-state

Conversation

@jellyfish0316
Copy link
Copy Markdown
Contributor

related to #16746

this PR ensures that active flow runs executed by a process worker are marked as Crashed when the worker shuts down gracefully.

Summary

When a process worker receives a graceful shutdown signal, the flow-run subprocess can be interrupted while the worker is tearing down. Before this change, that shutdown path could remove local tracking for the running subprocess without proposing a terminal state back to the API, leaving the flow run stuck in Running.

This PR updates the runner shutdown/cancellation path so that if a started process-backed flow run is interrupted during worker shutdown, Prefect proposes a Crashed state instead of leaving the run indefinitely Running.

What changed

  • updated the runner's flow-run monitoring path in src/prefect/runner/runner.py
  • when a started process-backed flow run is interrupted during worker shutdown, the runner now proposes a Crashed state during shielded cleanup
  • made bundle subprocess waiting cancellable so shutdown can enter the intended cleanup path
  • preserved existing cancellation and crashed hook behavior for bundle execution paths

Tests

Added regression coverage for both the worker-facing and runner-facing paths:

  • tests/cli/test_worker.py
    • verifies that a real process worker receiving graceful shutdown while a flow run is active results in the flow run becoming Crashed
  • tests/runner/test_runner.py
    • verifies the runner cancellation/crashed behavior does not regress existing bundle execution semantics

Notes

This change is intentionally scoped to the process worker / runner shutdown path.

It does not attempt to solve:

  • generic worker-health-to-run-health coupling
  • failover to another worker
  • zombie detection for hard termination scenarios like SIGKILL or host loss

This change is intentionally implemented at the Prefect layer rather than in AnyIO. While AnyIO and internal cancellation behavior can influence subprocess teardown timing, responsibility for reconciling flow-run state during worker shutdown belongs to the runner.

@github-actions github-actions bot added the enhancement An improvement of an existing feature label Mar 18, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 18, 2026

Merging this PR will not alter performance

✅ 2 untouched benchmarks


Comparing jellyfish0316:fix/process-worker-shutdown-crashed-state (b669b1e) with main (3a03c6c)

Open in CodSpeed

@jellyfish0316
Copy link
Copy Markdown
Contributor Author

Updated this PR to preserve the existing flow-run execute SIGTERM=reschedule behavior.

The original fix correctly marked active process-worker flow runs as Crashed during graceful shutdown, but it also regressed tests/cli/test_flow_run.py::TestSignalHandling::test_flow_run_execute_sigterm_handling[reschedule] by allowing explicitly rescheduled flow runs to be reconciled as Crashed.

This follow-up narrows the shutdown reconciliation logic so that flow runs explicitly rescheduled for resubmission are excluded from the new crash-reporting path.

I re-ran the CI-like local test command against Python 3.14, and the previously failing test_flow_run_execute_sigterm_handling[reschedule] case no longer appears in the failure summary.

@jellyfish0316 jellyfish0316 reopened this Mar 18, 2026
Copy link
Copy Markdown
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening a PR @jellyfish0316! Your approach (catch cancellation, propose Crashed) is right, but the fix should go in FlowRunExecutor.submit() in src/prefect/runner/_flow_run_executor.py, not the Runner methods, because the Runner is being decomposed into single-responsibility services and eventually phased out.

FlowRunExecutor.submit() has the same gap: its except Exception block doesn't catch CancelledError. Adding a handler there would be ~5 lines in the right place vs ~70 lines across legacy methods. Also, ProcessManager.__aexit__ already kills survivor processes, so _terminate_bundle_process isn't needed.

@jellyfish0316 jellyfish0316 force-pushed the fix/process-worker-shutdown-crashed-state branch from b702a44 to 91a55cf Compare March 18, 2026 18:27
@jellyfish0316
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed feedback.
Will update the PR ASAP — thanks again for the guidance!

Handle process worker shutdown without crashing rescheduled flow runs

Avoid crashing flow runs that were explicitly rescheduled

Handle shutdown crashes in executor and worker submission paths
@jellyfish0316 jellyfish0316 force-pushed the fix/process-worker-shutdown-crashed-state branch from 9c16c58 to dfcda4c Compare March 18, 2026 19:15
@jellyfish0316
Copy link
Copy Markdown
Contributor Author

this revision is not the final version yet.
I’ll take another pass on this tomorrow

@jellyfish0316
Copy link
Copy Markdown
Contributor Author

Updated this PR to move the main shutdown crash handling into FlowRunExecutor.submit() and removed the broader Runner-specific shutdown reconciliation changes from the earlier version.

While validating the real process worker path, I found that ProcessWorker still does not go entirely through FlowRunExecutor today. The actual scheduled worker execution path still reaches Runner.execute_flow_run() through BaseWorker._submit_run_and_capture_errors(), so I added the same cancellation-after-start handling there as a narrow compatibility fix for the current worker path.

I re-ran the targeted regressions locally and confirmed these pass:

  • tests/runner/test__flow_run_executor.py -k cancelled_after_start
  • tests/cli/test_worker.py -k graceful_shutdown_marks_active_flow_run_as_crashed
  • tests/cli/test_flow_run.py -k 'test_flow_run_execute_sigterm_handling and reschedule'
  • tests/runner/test_runner.py -k test_reschedule_flow_runs

@jellyfish0316
Copy link
Copy Markdown
Contributor Author

Hi @desertaxle, just checking if this looks good now or if anything else needs tweaking. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement An improvement of an existing feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants