Commit 0820e69
[train] TrainController reraises AsyncioActorExit (#59461)
# Summary
@justinvyu noticed the following logs
```
(TrainController pid=95437) [State Transition] RUNNING -> ABORTED.
(TrainController pid=95437) [FailurePolicy] RAISE
(TrainController pid=95437) Source: controller
(TrainController pid=95437) Error count: 1 (max allowed: 0)
(TrainController pid=95437)
(TrainController pid=95437) Traceback (most recent call last):
(TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 433, in _step
(TrainController pid=95437) worker_group_status: WorkerGroupPollStatus = await self._poll_workers()
(TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 493, in _resume_span
(TrainController pid=95437) return await method(self, *_args, **_kwargs)
(TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/v2/_internal/execution/controller/controller.py", line 283, in _poll_workers
(TrainController pid=95437) ray.actor.exit_actor()
(TrainController pid=95437) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/actor.py", line 2492, in exit_actor
(TrainController pid=95437) raise AsyncioActorExit()
(TrainController pid=95437) ray.train.ControllerError: Training failed due to controller error:
(TrainController pid=95437)
(TrainController pid=95437) [State Transition] ABORTED -> SHUTTING_DOWN.
```
The problem is that the fallback I implemented in
#58287 didn't work because the
`TrainController` caught the `AsyncioActorExit` raised by
`ray.actor.exit_actor` and handled it as a `ControllerError`. However,
what we actually want is to finish the abort asap by reraising the
exception.
# Testing
Unit tests. I didn't add a new unit test for this specifically because
the situation it covers happens flakily and would require a lot of
contrived mocking to reproduce.
---------
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>1 parent bd412da commit 0820e69
File tree
1 file changed
+3
-0
lines changed- python/ray/train/v2/_internal/execution/controller
1 file changed
+3
-0
lines changedLines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| 12 | + | |
12 | 13 | | |
13 | 14 | | |
14 | 15 | | |
| |||
431 | 432 | | |
432 | 433 | | |
433 | 434 | | |
| 435 | + | |
| 436 | + | |
434 | 437 | | |
435 | 438 | | |
436 | 439 | | |
| |||
0 commit comments