fix process-worker-shutdown-crashed-state by jellyfish0316 · Pull Request #21158 · PrefectHQ/prefect

jellyfish0316 · 2026-03-18T10:43:04Z

related to #16746

this PR ensures that active flow runs executed by a process worker are marked as Crashed when the worker shuts down gracefully.

Summary

When a process worker receives a graceful shutdown signal, the flow-run subprocess can be interrupted while the worker is tearing down. Before this change, that shutdown path could remove local tracking for the running subprocess without proposing a terminal state back to the API, leaving the flow run stuck in Running.

This PR updates the runner shutdown/cancellation path so that if a started process-backed flow run is interrupted during worker shutdown, Prefect proposes a Crashed state instead of leaving the run indefinitely Running.

What changed

updated the runner's flow-run monitoring path in src/prefect/runner/runner.py
when a started process-backed flow run is interrupted during worker shutdown, the runner now proposes a Crashed state during shielded cleanup
made bundle subprocess waiting cancellable so shutdown can enter the intended cleanup path
preserved existing cancellation and crashed hook behavior for bundle execution paths

Tests

Added regression coverage for both the worker-facing and runner-facing paths:

tests/cli/test_worker.py
- verifies that a real process worker receiving graceful shutdown while a flow run is active results in the flow run becoming Crashed
tests/runner/test_runner.py
- verifies the runner cancellation/crashed behavior does not regress existing bundle execution semantics

Notes

This change is intentionally scoped to the process worker / runner shutdown path.

It does not attempt to solve:

generic worker-health-to-run-health coupling
failover to another worker
zombie detection for hard termination scenarios like SIGKILL or host loss

This change is intentionally implemented at the Prefect layer rather than in AnyIO. While AnyIO and internal cancellation behavior can influence subprocess teardown timing, responsibility for reconciling flow-run state during worker shutdown belongs to the runner.

codspeed-hq · 2026-03-18T10:47:41Z

Merging this PR will not alter performance

✅ 2 untouched benchmarks

_{Comparing jellyfish0316:fix/process-worker-shutdown-crashed-state (b669b1e) with main (3a03c6c)}

jellyfish0316 · 2026-03-18T13:53:14Z

Updated this PR to preserve the existing flow-run execute SIGTERM=reschedule behavior.

The original fix correctly marked active process-worker flow runs as Crashed during graceful shutdown, but it also regressed tests/cli/test_flow_run.py::TestSignalHandling::test_flow_run_execute_sigterm_handling[reschedule] by allowing explicitly rescheduled flow runs to be reconciled as Crashed.

This follow-up narrows the shutdown reconciliation logic so that flow runs explicitly rescheduled for resubmission are excluded from the new crash-reporting path.

I re-ran the CI-like local test command against Python 3.14, and the previously failing test_flow_run_execute_sigterm_handling[reschedule] case no longer appears in the failure summary.

desertaxle

Thanks for opening a PR @jellyfish0316! Your approach (catch cancellation, propose Crashed) is right, but the fix should go in FlowRunExecutor.submit() in src/prefect/runner/_flow_run_executor.py, not the Runner methods, because the Runner is being decomposed into single-responsibility services and eventually phased out.

FlowRunExecutor.submit() has the same gap: its except Exception block doesn't catch CancelledError. Adding a handler there would be ~5 lines in the right place vs ~70 lines across legacy methods. Also, ProcessManager.__aexit__ already kills survivor processes, so _terminate_bundle_process isn't needed.

jellyfish0316 · 2026-03-18T18:32:58Z

Thanks for the detailed feedback.
Will update the PR ASAP — thanks again for the guidance!

Handle process worker shutdown without crashing rescheduled flow runs Avoid crashing flow runs that were explicitly rescheduled Handle shutdown crashes in executor and worker submission paths

jellyfish0316 · 2026-03-18T19:20:21Z

this revision is not the final version yet.
I’ll take another pass on this tomorrow

jellyfish0316 · 2026-03-19T04:56:38Z

Updated this PR to move the main shutdown crash handling into FlowRunExecutor.submit() and removed the broader Runner-specific shutdown reconciliation changes from the earlier version.

While validating the real process worker path, I found that ProcessWorker still does not go entirely through FlowRunExecutor today. The actual scheduled worker execution path still reaches Runner.execute_flow_run() through BaseWorker._submit_run_and_capture_errors(), so I added the same cancellation-after-start handling there as a narrow compatibility fix for the current worker path.

I re-ran the targeted regressions locally and confirmed these pass:

tests/runner/test__flow_run_executor.py -k cancelled_after_start
tests/cli/test_worker.py -k graceful_shutdown_marks_active_flow_run_as_crashed
tests/cli/test_flow_run.py -k 'test_flow_run_execute_sigterm_handling and reschedule'
tests/runner/test_runner.py -k test_reschedule_flow_runs

src/prefect/runner/runner.py

jellyfish0316 · 2026-03-25T10:32:09Z

Hi @desertaxle, just checking if this looks good now or if anything else needs tweaking. Thanks!

jellyfish0316 requested review from chrisguidry, cicdw, desertaxle and zzstoatzz as code owners March 18, 2026 10:43

github-actions bot added the enhancement An improvement of an existing feature label Mar 18, 2026

jellyfish0316 closed this Mar 18, 2026

jellyfish0316 reopened this Mar 18, 2026

jellyfish0316 closed this Mar 18, 2026

jellyfish0316 reopened this Mar 18, 2026

desertaxle requested changes Mar 18, 2026

View reviewed changes

jellyfish0316 force-pushed the fix/process-worker-shutdown-crashed-state branch from b702a44 to 91a55cf Compare March 18, 2026 18:27

fix process-worker-shutdown-crashed-state

dfcda4c

Handle process worker shutdown without crashing rescheduled flow runs Avoid crashing flow runs that were explicitly rescheduled Handle shutdown crashes in executor and worker submission paths

jellyfish0316 force-pushed the fix/process-worker-shutdown-crashed-state branch from 9c16c58 to dfcda4c Compare March 18, 2026 19:15

Preserve rescheduled flow runs during shutdown

f732526

jellyfish0316 requested a review from desertaxle March 20, 2026 05:21

desertaxle requested changes Mar 20, 2026

View reviewed changes

src/prefect/runner/runner.py Outdated Show resolved Hide resolved

Fix cancellation ordering in runner

b669b1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix process-worker-shutdown-crashed-state#21158

fix process-worker-shutdown-crashed-state#21158
jellyfish0316 wants to merge 3 commits intoPrefectHQ:mainfrom
jellyfish0316:fix/process-worker-shutdown-crashed-state

jellyfish0316 commented Mar 18, 2026

Uh oh!

codspeed-hq bot commented Mar 18, 2026 •

edited

Loading

Uh oh!

jellyfish0316 commented Mar 18, 2026

Uh oh!

desertaxle left a comment

Uh oh!

jellyfish0316 commented Mar 18, 2026

Uh oh!

jellyfish0316 commented Mar 18, 2026

Uh oh!

jellyfish0316 commented Mar 19, 2026

Uh oh!

Uh oh!

jellyfish0316 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jellyfish0316 commented Mar 18, 2026

Summary

What changed

Tests

Notes

Uh oh!

codspeed-hq bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Uh oh!

jellyfish0316 commented Mar 18, 2026

Uh oh!

desertaxle left a comment

Choose a reason for hiding this comment

Uh oh!

jellyfish0316 commented Mar 18, 2026

Uh oh!

jellyfish0316 commented Mar 18, 2026

Uh oh!

jellyfish0316 commented Mar 19, 2026

Uh oh!

Uh oh!

jellyfish0316 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq bot commented Mar 18, 2026 •

edited

Loading