Skip to content

[train] Exit actor and log appropriately when poll_workers is in terminal state#58287

Merged
justinvyu merged 3 commits intoray-project:masterfrom
TimothySeah:tseah/fix-shutting-down-aborted
Nov 17, 2025
Merged

[train] Exit actor and log appropriately when poll_workers is in terminal state#58287
justinvyu merged 3 commits intoray-project:masterfrom
TimothySeah:tseah/fix-shutting-down-aborted

Conversation

@TimothySeah
Copy link
Contributor

@TimothySeah TimothySeah commented Oct 29, 2025

Summary

@justinvyu observed the following logs:

(TrainController pid=109066) [State Transition] RUNNING -> ABORTED.
(TrainController pid=109066) [FailurePolicy] Decision: FailureDecision.RAISE, Error source: controller, Error count / maximum errors allowed: 1/0, Error: Training failed due to controller error:
(TrainController pid=109066) Worker group is not active. Call WorkerGroup.create() to create a new worker group.
(TrainController pid=109066) [State Transition] ABORTED -> SHUTTING_DOWN.

This indicates the following sequence of events:

  1. The controller is RUNNING
  2. The controller calls _poll_workers and therefore asyncio.sleeps
  3. The user Ctrl - C's
  4. The asyncio event loop switches to the abort asyncio task, which shuts down the worker group and prints RUNNING -> ABORTED
  5. The abort asyncio.task calls ray.actor.exit_actor, but https://docs.ray.io/en/latest/ray-core/api/doc/ray.actor.exit_actor.html says For asyncio actors, there may be a short delay before the actor exits if the API is called from a background task
  6. The asyncio event loop switches back to the _poll_workers task, which fails an assert (the Worker group is not active log above) and goes through the SHUTTING_DOWN + ERRORED path.

This PR does the following:

  1. Add traceback to all ControllerErrors and log it when making a failure decision so we can see where Worker group is not active. Call WorkerGroup.create() to create a new worker group. is coming from. I also sanity checked that this does not cause UserExceptionWithTraceback to double print the traceback because this only applies to ControllerError
  2. _poll_workers has the only asyncio.sleep in the Ray Train controller. After waking up, it exits from the foreground asyncio task if its state is terminal, which can happen due to the issue mentioned in 5).

Testing

Unit tests - I was unable to reproduce this in a workspace since the exit_actor behavior is inconsistent.

I created a fake failure to showcase what the logs look like before and after adding a traceback:

Before

(TrainController pid=90972) [FailurePolicy] RAISE
(TrainController pid=90972)   Source: controller
(TrainController pid=90972)   Error count: 1 (max allowed: 0)
(TrainController pid=90972) 
(TrainController pid=90972) Training failed due to controller error:
(TrainController pid=90972) fake error

After

(TrainController pid=15478) [FailurePolicy] RAISE
(TrainController pid=15478)   Source: controller
(TrainController pid=15478)   Error count: 1 (max allowed: 0)
(TrainController pid=15478) 
(TrainController pid=15478) Traceback (most recent call last):
(TrainController pid=15478)   File "/Users/tseah/ray/python/ray/train/v2/_internal/execution/controller/controller.py", line 421, in _step
(TrainController pid=15478)     raise ValueError("fake error")
(TrainController pid=15478) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=15478) fake error

Here is what worker group failures look like without a traceback:

python/ray/train/v2/tests/test_failure_policy.py::test_max_failures[1] ✓                                              22% ██▎       2025-11-17 13:27:49,354	INFO default.py:44 -- [FailurePolicy] RETRY
  Source: worker group
  Error count: 1 (max allowed: 10)

ray.train.WorkerGroupError: Training failed due to worker errors:
Worker group failed

…inal state

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah changed the title [train] Exit actor and log appropriately when poll_workers is in term… [train] Exit actor and log appropriately when poll_workers is in terminal state Oct 29, 2025
@TimothySeah TimothySeah marked this pull request as ready for review October 29, 2025 22:36
@TimothySeah TimothySeah requested a review from a team as a code owner October 29, 2025 22:36
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Oct 30, 2025
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Nov 17, 2025
@justinvyu justinvyu enabled auto-merge (squash) November 17, 2025 21:46
@github-actions github-actions bot disabled auto-merge November 17, 2025 22:45
@justinvyu justinvyu merged commit 96bc3b6 into ray-project:master Nov 17, 2025
6 checks passed
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…inal state (ray-project#58287)

1) Add traceback to all `ControllerError`s and log it when making a
failure decision so we can see where `Worker group is not active. Call
WorkerGroup.create() to create a new worker group.` is coming from. **I
also sanity checked that this does not cause
`UserExceptionWithTraceback` to double print the traceback because this
only applies to ControllerError**
2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train
controller. After waking up, it exits from the foreground asyncio task
if its state is terminal, which can happen due to the issue mentioned in
5).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…inal state (ray-project#58287)

1) Add traceback to all `ControllerError`s and log it when making a
failure decision so we can see where `Worker group is not active. Call
WorkerGroup.create() to create a new worker group.` is coming from. **I
also sanity checked that this does not cause
`UserExceptionWithTraceback` to double print the traceback because this
only applies to ControllerError**
2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train
controller. After waking up, it exits from the foreground asyncio task
if its state is terminal, which can happen due to the issue mentioned in
5).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…inal state (ray-project#58287)

1) Add traceback to all `ControllerError`s and log it when making a
failure decision so we can see where `Worker group is not active. Call
WorkerGroup.create() to create a new worker group.` is coming from. **I
also sanity checked that this does not cause
`UserExceptionWithTraceback` to double print the traceback because this
only applies to ControllerError**
2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train
controller. After waking up, it exits from the foreground asyncio task
if its state is terminal, which can happen due to the issue mentioned in
5).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…inal state (ray-project#58287)

1) Add traceback to all `ControllerError`s and log it when making a
failure decision so we can see where `Worker group is not active. Call
WorkerGroup.create() to create a new worker group.` is coming from. **I
also sanity checked that this does not cause
`UserExceptionWithTraceback` to double print the traceback because this
only applies to ControllerError**
2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train
controller. After waking up, it exits from the foreground asyncio task
if its state is terminal, which can happen due to the issue mentioned in
5).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
matthewdeng added a commit that referenced this pull request Dec 16, 2025
# Summary

@justinvyu noticed the following logs 

```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437) 
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437) 
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```

The problem is that the fallback I implemented in
#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.

# Testing

Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
cszhu pushed a commit that referenced this pull request Dec 17, 2025
# Summary

@justinvyu noticed the following logs 

```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437) 
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437) 
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```

The problem is that the fallback I implemented in
#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.

# Testing

Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
zzchun pushed a commit to zzchun/ray that referenced this pull request Dec 18, 2025
# Summary

@justinvyu noticed the following logs 

```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437) 
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437) 
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```

The problem is that the fallback I implemented in
ray-project#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.

# Testing

Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
# Summary

@justinvyu noticed the following logs 

```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437) 
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437) 
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```

The problem is that the fallback I implemented in
ray-project#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.

# Testing

Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…inal state (ray-project#58287)

1) Add traceback to all `ControllerError`s and log it when making a
failure decision so we can see where `Worker group is not active. Call
WorkerGroup.create() to create a new worker group.` is coming from. **I
also sanity checked that this does not cause
`UserExceptionWithTraceback` to double print the traceback because this
only applies to ControllerError**
2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train
controller. After waking up, it exits from the foreground asyncio task
if its state is terminal, which can happen due to the issue mentioned in
5).

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
# Summary

@justinvyu noticed the following logs

```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437)   Source: controller
(TrainController pid=95437)   Error count: 1 (max allowed: 0)
(TrainController pid=95437)
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437)     worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437)     return await method(self, *_args, **_kwargs)
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437)     ray.actor.exit_actor()
(TrainController pid=95437)   File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437)     raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437)
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```

The problem is that the fallback I implemented in
ray-project#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.

# Testing

Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants