Skip to content

Commit 0820e69

Browse files
[train] TrainController reraises AsyncioActorExit (#59461)
# Summary @justinvyu noticed the following logs ``` (TrainController pid=95437) [State Transition] RUNNING -> ABORTED. (TrainController pid=95437) [FailurePolicy] RAISE (TrainController pid=95437) Source: controller (TrainController pid=95437) Error count: 1 (max allowed: 0) (TrainController pid=95437) (TrainController pid=95437) Traceback (most recent call last): (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step (TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span (TrainController pid=95437) return await method(self, *_args, **_kwargs) (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers (TrainController pid=95437) ray.actor.exit_actor() (TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor (TrainController pid=95437) raise AsyncioActorExit() (TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error: (TrainController pid=95437) (TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN. ``` The problem is that the fallback I implemented in #58287 didn't work because the `TrainController` caught the `AsyncioActorExit` raised by `ray.actor.exit_actor` and handled it as a `ControllerError`. However, what we actually want is to finish the abort asap by reraising the exception. # Testing Unit tests. I didn't add a new unit test for this specifically because the situation it covers happens flakily and would require a lot of contrived mocking to reproduce. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
1 parent bd412da commit 0820e69

File tree

1 file changed

+3
-0
lines changed

1 file changed

+3
-0
lines changed

python/ray/train/v2/_internal/execution/controller/controller.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
import ray
1111
import ray._private.ray_constants as ray_constants
12+
from ray.exceptions import AsyncioActorExit
1213
from ray.train.v2._internal.constants import (
1314
DEFAULT_ENABLE_CONTROLLER_LOGGING,
1415
DEFAULT_HEALTH_CHECK_INTERVAL_S,
@@ -431,6 +432,8 @@ async def _step(self) -> TrainControllerLoopIterationResult:
431432
elif isinstance(controller_state, RunningState):
432433
try:
433434
worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
435+
except AsyncioActorExit:
436+
raise
434437
except Exception as e:
435438
training_failed_error = ControllerError(e)
436439
failure_decision = self._failure_policy.make_decision(

0 commit comments

Comments
 (0)