[train] Exit actor and log appropriately when poll_workers is in terminal state#58287
Merged
justinvyu merged 3 commits intoray-project:masterfrom Nov 17, 2025
Merged
Conversation
…inal state Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
reviewed
Nov 17, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
approved these changes
Nov 17, 2025
Aydin-ab
pushed a commit
to Aydin-ab/ray-aydin
that referenced
this pull request
Nov 19, 2025
…inal state (ray-project#58287) 1) Add traceback to all `ControllerError`s and log it when making a failure decision so we can see where `Worker group is not active. Call WorkerGroup.create() to create a new worker group.` is coming from. **I also sanity checked that this does not cause `UserExceptionWithTraceback` to double print the traceback because this only applies to ControllerError** 2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train controller. After waking up, it exits from the foreground asyncio task if its state is terminal, which can happen due to the issue mentioned in 5). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo
pushed a commit
to ykdojo/ray
that referenced
this pull request
Nov 27, 2025
…inal state (ray-project#58287) 1) Add traceback to all `ControllerError`s and log it when making a failure decision so we can see where `Worker group is not active. Call WorkerGroup.create() to create a new worker group.` is coming from. **I also sanity checked that this does not cause `UserExceptionWithTraceback` to double print the traceback because this only applies to ControllerError** 2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train controller. After waking up, it exits from the foreground asyncio task if its state is terminal, which can happen due to the issue mentioned in 5). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen
pushed a commit
to SheldonTsen/ray
that referenced
this pull request
Dec 1, 2025
…inal state (ray-project#58287) 1) Add traceback to all `ControllerError`s and log it when making a failure decision so we can see where `Worker group is not active. Call WorkerGroup.create() to create a new worker group.` is coming from. **I also sanity checked that this does not cause `UserExceptionWithTraceback` to double print the traceback because this only applies to ControllerError** 2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train controller. After waking up, it exits from the foreground asyncio task if its state is terminal, which can happen due to the issue mentioned in 5). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
Future-Outlier
pushed a commit
to Future-Outlier/ray
that referenced
this pull request
Dec 7, 2025
…inal state (ray-project#58287) 1) Add traceback to all `ControllerError`s and log it when making a failure decision so we can see where `Worker group is not active. Call WorkerGroup.create() to create a new worker group.` is coming from. **I also sanity checked that this does not cause `UserExceptionWithTraceback` to double print the traceback because this only applies to ControllerError** 2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train controller. After waking up, it exits from the foreground asyncio task if its state is terminal, which can happen due to the issue mentioned in 5). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
matthewdeng
added a commit
that referenced
this pull request
Dec 16, 2025
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in #58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
cszhu
pushed a commit
that referenced
this pull request
Dec 17, 2025
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in #58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
zzchun
pushed a commit
to zzchun/ray
that referenced
this pull request
Dec 18, 2025
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in ray-project#58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Yicheng-Lu-llll
pushed a commit
to Yicheng-Lu-llll/ray
that referenced
this pull request
Dec 22, 2025
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in ray-project#58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
…inal state (ray-project#58287) 1) Add traceback to all `ControllerError`s and log it when making a failure decision so we can see where `Worker group is not active. Call WorkerGroup.create() to create a new worker group.` is coming from. **I also sanity checked that this does not cause `UserExceptionWithTraceback` to double print the traceback because this only applies to ControllerError** 2) `_poll_workers` has the only `asyncio.sleep` in the Ray Train controller. After waking up, it exits from the foreground asyncio task if its state is terminal, which can happen due to the issue mentioned in 5). --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli
pushed a commit
to peterxcli/ray
that referenced
this pull request
Feb 25, 2026
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in ray-project#58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
@justinvyu observed the following logs:
This indicates the following sequence of events:
RUNNING_poll_workersand thereforeasyncio.sleepsRUNNING -> ABORTEDasyncio.taskcallsray.actor.exit_actor, but https://docs.ray.io/en/latest/ray-core/api/doc/ray.actor.exit_actor.html saysFor asyncio actors, there may be a short delay before the actor exits if the API is called from a background task_poll_workerstask, which fails an assert (theWorker group is not activelog above) and goes through theSHUTTING_DOWN+ERROREDpath.This PR does the following:
ControllerErrors and log it when making a failure decision so we can see whereWorker group is not active. Call WorkerGroup.create() to create a new worker group.is coming from. I also sanity checked that this does not causeUserExceptionWithTracebackto double print the traceback because this only applies to ControllerError_poll_workershas the onlyasyncio.sleepin the Ray Train controller. After waking up, it exits from the foreground asyncio task if its state is terminal, which can happen due to the issue mentioned in 5).Testing
Unit tests - I was unable to reproduce this in a workspace since the
exit_actorbehavior is inconsistent.I created a fake failure to showcase what the logs look like before and after adding a traceback:
Before
After
Here is what worker group failures look like without a traceback: