fix process-worker-shutdown-crashed-state#21158
fix process-worker-shutdown-crashed-state#21158jellyfish0316 wants to merge 3 commits intoPrefectHQ:mainfrom
Conversation
|
Updated this PR to preserve the existing The original fix correctly marked active process-worker flow runs as This follow-up narrows the shutdown reconciliation logic so that flow runs explicitly rescheduled for resubmission are excluded from the new crash-reporting path. I re-ran the CI-like local test command against Python 3.14, and the previously failing |
desertaxle
left a comment
There was a problem hiding this comment.
Thanks for opening a PR @jellyfish0316! Your approach (catch cancellation, propose Crashed) is right, but the fix should go in FlowRunExecutor.submit() in src/prefect/runner/_flow_run_executor.py, not the Runner methods, because the Runner is being decomposed into single-responsibility services and eventually phased out.
FlowRunExecutor.submit() has the same gap: its except Exception block doesn't catch CancelledError. Adding a handler there would be ~5 lines in the right place vs ~70 lines across legacy methods. Also, ProcessManager.__aexit__ already kills survivor processes, so _terminate_bundle_process isn't needed.
b702a44 to
91a55cf
Compare
|
Thanks for the detailed feedback. |
Handle process worker shutdown without crashing rescheduled flow runs Avoid crashing flow runs that were explicitly rescheduled Handle shutdown crashes in executor and worker submission paths
9c16c58 to
dfcda4c
Compare
|
this revision is not the final version yet. |
|
Updated this PR to move the main shutdown crash handling into While validating the real process worker path, I found that I re-ran the targeted regressions locally and confirmed these pass:
|
|
Hi @desertaxle, just checking if this looks good now or if anything else needs tweaking. Thanks! |
related to #16746
this PR ensures that active flow runs executed by a
processworker are marked asCrashedwhen the worker shuts down gracefully.Summary
When a process worker receives a graceful shutdown signal, the flow-run subprocess can be interrupted while the worker is tearing down. Before this change, that shutdown path could remove local tracking for the running subprocess without proposing a terminal state back to the API, leaving the flow run stuck in
Running.This PR updates the runner shutdown/cancellation path so that if a started process-backed flow run is interrupted during worker shutdown, Prefect proposes a
Crashedstate instead of leaving the run indefinitelyRunning.What changed
src/prefect/runner/runner.pyCrashedstate during shielded cleanupTests
Added regression coverage for both the worker-facing and runner-facing paths:
tests/cli/test_worker.pyprocessworker receiving graceful shutdown while a flow run is active results in the flow run becomingCrashedtests/runner/test_runner.pyNotes
This change is intentionally scoped to the
processworker / runner shutdown path.It does not attempt to solve:
SIGKILLor host lossThis change is intentionally implemented at the Prefect layer rather than in AnyIO. While AnyIO and internal cancellation behavior can influence subprocess teardown timing, responsibility for reconciling flow-run state during worker shutdown belongs to the runner.