[Train] Skip incrementing failure counter on preemption node died failures by justinvyu · Pull Request #41285 · ray-project/ray

justinvyu · 2023-11-21T01:19:09Z

Why are these changes needed?

Users expect different failures types to be handled differently:

The current behavior is that the count decrements, regardless of the error type. For example, if 3 pre-emptions happen with max_failures=3, then the run will end without continuing to recover through preemptions.
With max_failures=-1 or some large value, there will be an infinite number of retries, but this could crash-loop on an application error (ex: a bug in the user code). This can be very expensive.

This PR changes the failure counting of Ray Train/Tune to ignore spot instance preemption failures by default. This behavior is enabled by the new RayActorError.preempted flag introduced in #41102 that is set if the underlying cluster setup handles the cloud preemption signals properly and sets the preempting node to the DRAINING status.

Example

Here is an example scenario:

Training begins on 4 nodes A, B, C, D.
Node D gets the 2 minute preemption signal, and becomes a draining node. No more new tasks get scheduled on this draining node.
Eventually, Nodes A, B, C get stuck on a barrier somewhere (train.report / torch.distributed), while D is preempted.
The heartbeat check eventually realizes that D has died, and all actors on that node will raise a RayActorError(preempted=True).
Ray Train/Tune will decide whether or not to retry from the latest available checkpoint. Ray Train/Tune does not increment the failure counter since preempted=True. This allows preemption failures to be retried repeatedly without contributing to train.FailureConfig(max_failures=X).
- You can configure Ray Train/Tune to still count preemption failures as part of the X max failures by setting the environment variable: RAY_TRAIN_COUNT_PREEMPTION_AS_FAILURE=1

Miscellaneous

This is the current output in error.txt. TODO: the numbering should be fixed, and some indication of ignored errors should be added in.

Failure # 0 (occurred at 2023-11-20_17-12-21)
ray::_Inner.train() (pid=14886, ip=127.0.0.1, actor_id=3915b4db0697fc923c372b8501000000, repr=FailingDataParallelTrainer)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/trainable/trainable.py", line 342, in train
    raise skipped from exception_cause(skipped)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/air/_internal/util.py", line 88, in run
    self._ret = self._target(*self._args, **self._kwargs)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/trainable/function_trainable.py", line 115, in <lambda>
    training_func=lambda: self._trainable_func(self.config),
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/base_trainer.py", line 819, in _trainable_func
    super()._trainable_func(self._merged_config)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/trainable/function_trainable.py", line 332, in _trainable_func
    output = fn()
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/base_trainer.py", line 729, in train_func
    trainer.training_loop()
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/data_parallel_trainer.py", line 470, in training_loop
    self._run_training(training_iterator)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/data_parallel_trainer.py", line 368, in _run_training
    for training_results in training_iterator:
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/trainer.py", line 123, in __next__
    next_results = self._run_with_error_handling(self._fetch_next_result)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/trainer.py", line 89, in _run_with_error_handling
    return func()
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/trainer.py", line 153, in _fetch_next_result
    results = self._backend_executor.get_next_results()
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/_internal/backend_executor.py", line 580, in get_next_results
    results = self.get_with_failure_handling(futures)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/_internal/backend_executor.py", line 661, in get_with_failure_handling
    self._increment_failures()
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/_internal/backend_executor.py", line 723, in _increment_failures
    raise failure
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/_internal/utils.py", line 43, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: RayTrainWorker
        actor_id: 1111c1077278ea06a057146601000000
        pid: 14910
        namespace: 6f661879-c771-411d-bbe0-826c27ba380f
        ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Failure # 1 (occurred at 2023-11-20_17-12-23)
ray::_Inner.train() (pid=14942, ip=127.0.0.1, actor_id=24f6b26cac30fb54de7f46c801000000, repr=FailingDataParallelTrainer)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/trainable/trainable.py", line 342, in train
    raise skipped from exception_cause(skipped)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/_internal/utils.py", line 43, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=14995, ip=127.0.0.1, actor_id=082556d2df30a5deb48d36c301000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x10f57d790>)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/train/_internal/utils.py", line 118, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/Users/justin/Developer/justinvyu-dev/python/ray/air/tests/test_errors.py", line 237, in train_fn
    raise RuntimeError("This is an error in the user code.")
RuntimeError: This is an error in the user code.
Failure # 1 (occurred at 2023-11-20_17-12-26)
The actor died unexpectedly before finishing this task.
        class_name: with_parameters.<locals>._Inner
        actor_id: 4d1fe9b5c1a0390f3fcc1c9e01000000
        pid: 15040
        namespace: 6f661879-c771-411d-bbe0-826c27ba380f
        ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…_recover logic accordingly Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2023-11-21T01:26:13Z

python/ray/tune/execution/tune_controller.py

        trial.set_location(_Location())

-        if exception:
-            trial.handle_error(exc=exception)


Key change 1: The reason for moving this trial.handle_error is:

handle_error is what increments the number of failures.

Upon a trial failure, we used to check trial.should_recover BEFORE incrementing the new failure, so trial.num_failures is 1 less than it should be at that check. (Let's say num_failures=2, max_failures=3 at this point.)

handle_error would happen afterwards right here. (num_failures=3 now.)

The old should_recover condition made it impossible for us to try recovering on a preemption error. (Even though handle_error would noop for the preemption error, we're already at num_failures==max_failures=3, so should_recover=False.)

It's more intuitive to have num_failures updated by the time of trial.should_recover, so now we just handle the error separately.

Seems like we need to call trial.handle_error for all the other places that are currently calling _schedule_trial_stop?

In general though I think this does move us in the right direction. In the long-term we should clean up the state machine in the tune controller, in it's current state it's not really clear where error handling is supposed to take place. 😵‍💫

Yep, I call it in the other 2 other places.

justinvyu · 2023-11-21T01:28:52Z

python/ray/tune/experiment/trial.py

+    def _handle_ray_actor_error(self, exc: RayActorError):
+        exc._preempted = True  # TODO(justinvyu): Test the real integration
+        if not exc._preempted:
+            # Only count non-preempted actor errors as failures.
+            self.run_metadata.num_failures += 1
+
+    def _handle_ray_task_error(self, exc: RayTaskError):
+        if isinstance(exc.cause, RayActorError):
+            # Handle the RayActorError directly (ex: Ray Train worker actor errors)
+            return self._handle_ray_actor_error(exc.cause)
+
+        # Increment failures for all user errors (which get raised as RayTaskError)
+        self.run_metadata.num_failures += 1


Key change 2: this is the actual logic of the PR.

Question: Is it ok to treat RayTaskError with a cause of RayActorError so broadly like this? One strawman counterexample:

def tune_fn_trainable(config): e = RayActorError() e._preempted = True raise e tune.Tuner(tune_fn_trainable).fit()

Another possibility would be to have the DataParallelTrainer pass through the pre-emption RayActorError as a special case, but I feel like that's more misleading, as it's disguising the coordinator's error with the worker's error.

~~TODO: use exc.as_instanceof_cause() instead of the private cause attr once that is fixed by @rkooo567~~

This seems okay for now... this seems clean enough for now such that if new use cases come up in the future we can separate this logic and improve it further.

~~TODO: add the configurability of whether or not to count preemption errors here.~~

justinvyu · 2023-11-21T01:29:33Z

python/ray/tune/experiment/trial.py

+        `num_failures` should represent the number of times the trial has
+        failed *up to the moment this method is called.* If we've failed
+        5 times and `max_failures=5`, then we should recover, since
+        we only pass the limit on the 6th failure.
+
+        Note this may return true even when there is no checkpoint, either because
        `self.checkpoint_freq` is `0` or because the trial failed before
        a checkpoint has been made.
        """
        return (
-            self.run_metadata.num_failures < self.max_failures
-            or self.max_failures < 0
-            or (
-                self.run_metadata.num_failures == self.max_failures
-                and self.temporary_state.num_restore_failures
-                < int(os.environ.get("TUNE_RESTORE_RETRY_NUM", 0))
-            )
+            self.run_metadata.num_failures <= self.max_failures or self.max_failures < 0


See key change 1 comment.

justinvyu · 2023-11-21T01:35:58Z

python/ray/tune/experiment/trial.py

-                self.run_metadata.num_failures == self.max_failures
-                and self.temporary_state.num_restore_failures
-                < int(os.environ.get("TUNE_RESTORE_RETRY_NUM", 0))
-            )


I believe this condition is not needed anymore.

TUNE_RESTORE_RETRY_NUM configures how many attempts we try to restore before it counts as a real error.

The behavior with this condition removed makes sense to me:

If I'm at num_failures==max_failures, then I should try up to TUNE_RESTORE_RETRY_NUM times to restore. If all of those attempts fail, then we'll increment so that num_failures > max_failures, and the run will not try to recover anymore.

anyscalesam

General question - what's the recommendation to users on how to configure checkpointing at for Train Jobs running on top of preemptible Clusters?

UPDATE - Addressed offline.

matthewdeng · 2023-11-22T01:36:40Z

python/ray/air/tests/test_errors.py

+    - Round 0: Actor error in the training worker. (shouldn't be counted)
+    - Round 1: User error in the training worker.
+    - Round 2: Actor error in the coordinator actor. (shouldn't be counted)
+    - Round 3: No error.


Should we just run this as 4 separate jobs and check each one if it failed/counted?

Yeah good idea.

I am not able to figure out how to mock a property on the RayActorError that core raises -- any ideas here?

I tried this:

class MockRayActorError(ray.exceptions.RayActorError): preempted = True monkeypatch.setattr( ray.tune.execution.tune_controller, "RayActorError", MockRayActorError ) monkeypatch.setattr(ray.exceptions, "RayActorError", MockRayActorError)

I was planning on reworking this test to use the actual gcs_client.drain_node API to mock the preemption instead of mocking the attribute. (example here)

@jjyao any ideas here?

matthewdeng · 2023-11-22T01:48:24Z

python/ray/tune/experiment/trial.py

+    def _handle_restore_error(self, exc: _TuneRestoreError):
+        exc = exc.exc
+        if self.temporary_state.num_restore_failures >= int(
+            os.environ.get("TUNE_RESTORE_RETRY_NUM", 0)
+        ):
+            # Restore was unsuccessful, try again without checkpoint.
+            self.clear_checkpoint()
+            self.run_metadata.num_failures += 1
+        else:
+            self.temporary_state.num_restore_failures += 1


Orthogonal to this change but I'm wondering if we even want to keep this logic... not really clear to me why we remove the checkpoint and increase the number of failures.

This handle_restore_error happens when the call to Trainable.restore fails:

This may be caused by a checkpoint download from cloud failing. Retrying without adding to the total failures counter may help here.

There may be a bug in a user's load_checkpoint code. Retrying wouldn't help here.

Function trainables don't do any logic in restore/load_checkpoint, leaving it to the user instead -- so this only really applies to class trainables.

The default behavior is a little strange though: TUNE_RESTORE_RETRY_NUM=0 --> failures during restore clear the checkpoint count toward num_failures and the run starts from scratch immediately.

If we remove this logic, the behavior becomes: failure during restore are treated normally and keep retrying from the checkpoint until max_failures. I think it makes sense to remove this and restoring_from so that we have to keep track of less state in total. Let's do that in a separate PR.

It seems the function here treats the restoration error differently as normal training error. I.e., there is a separate counter on restoration error that a consecutive of TUNE_RESTORE_RETRY_NUM restoration error will count as one num_failures.

I think it still makes sense if we keep this logic here. However, it doesn't makes much sense for us to clear or modify the latest checkpoint content. The clear_checkpoint function is mainly prepared for the cases that a corrupted checkpoint leading to restoration error. I agree it might be one reason of the problem, but not generally the only reason for a restoration failure. I think the easiest way for us to fix here is to remove the clear_checkpoint function . We can still keep the handle_restore_error function as a special case of all errors.

I.e., TUNE_RESTORE_RETRY_NUM restoration failures contributes to one num_failures. But we don't pre-assume or add special handling to fix the restoration error. It might be more likely due to a node preemption, that we don't need special handling, just by chance it may fail/ success. Add a few more retries can already help. We should not clear the latest checkpoint, which introduces extra complexity. In case of a corrupted latest checkpoint, we just let the job fail after TUNE_RESTORE_RETRY_NUM * TUNE_RESTORE_NUM

cc @justinvyu @matthewdeng, if it looks good, I can make a PR to fix this.

matthewdeng · 2023-11-22T02:23:55Z

python/ray/tune/execution/tune_controller.py

        trial.set_location(_Location())

-        if exception:
-            trial.handle_error(exc=exception)


Seems like we need to call trial.handle_error for all the other places that are currently calling _schedule_trial_stop?

matthewdeng · 2023-11-22T02:29:22Z

python/ray/tune/execution/tune_controller.py

        trial.set_location(_Location())

-        if exception:
-            trial.handle_error(exc=exception)


In general though I think this does move us in the right direction. In the long-term we should clean up the state machine in the tune controller, in it's current state it's not really clear where error handling is supposed to take place. 😵‍💫

matthewdeng · 2023-11-22T02:36:18Z

python/ray/tune/experiment/trial.py

+    def _handle_ray_actor_error(self, exc: RayActorError):
+        exc._preempted = True  # TODO(justinvyu): Test the real integration
+        if not exc._preempted:
+            # Only count non-preempted actor errors as failures.
+            self.run_metadata.num_failures += 1
+
+    def _handle_ray_task_error(self, exc: RayTaskError):
+        if isinstance(exc.cause, RayActorError):
+            # Handle the RayActorError directly (ex: Ray Train worker actor errors)
+            return self._handle_ray_actor_error(exc.cause)
+
+        # Increment failures for all user errors (which get raised as RayTaskError)
+        self.run_metadata.num_failures += 1


This seems okay for now... this seems clean enough for now such that if new use cases come up in the future we can separate this logic and improve it further.

matthewdeng · 2023-11-22T02:37:49Z

python/ray/tune/experiment/trial.py

        if self.local_path:
            self.run_metadata.error_filename = EXPR_ERROR_FILE
-            if isinstance(exc, RayTaskError):
+            if isinstance(exc, (RayTaskError, RayActorError)):


Hmmm given that we've never logged these before, when does RayActorError actually happen? Would it be a RayActorError or RayTaskError if the trial node gets preempted?

If the trial node gets preempted, it's a RayActorError.

ray.get(A.task.remote()) -> RayActorError if A's node dies ray.get(A.task.remote()) -> RayTaskError(OriginalError) if A.task raises an OriginalError inside it.

I think it was just an oversight not to log RayActorError

…le_spot_instance_failures

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

matthewdeng · 2023-11-28T03:14:27Z

python/ray/tune/execution/tune_controller.py

            trial.temporary_state.saving_to = None
-        if trial.is_restoring and exc:
-            exc = _TuneRestoreError(exc)
        self._schedule_trial_stop(trial, exception=exc)


Do we need to call trial.handle_error(exception) before this one?

try_recover only gets called in process_trial_failure, which already calls handle_error.

matthewdeng · 2023-11-28T03:21:27Z

python/ray/air/tests/test_errors.py

+    - Round 0: Actor error in the training worker. (shouldn't be counted)
+    - Round 1: User error in the training worker.
+    - Round 2: Actor error in the coordinator actor. (shouldn't be counted)
+    - Round 3: No error.


@jjyao any ideas here?

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

This reverts commit ba62f43. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…le_spot_instance_failures

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…le_spot_instance_failures

matthewdeng

Awesome!

python/ray/train/constants.py

python/ray/tune/tests/test_trial.py

python/ray/tune/experiment/trial.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…le_spot_instance_failures

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…le_spot_instance_failures

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…d failures (ray-project#41285) Users expect different failures types to be handled differently in step 4 above: * The current behavior is that the count decrements, regardless of the error type. For example, if 3 pre-emptions happen with `max_failures=3`, then the run will end without continuing to recover through preemptions. * With `max_failures=-1` or some large value, there will be an infinite number of retries, but this could crash-loop on an application error (ex: a bug in the user code). This can be very expensive. This PR changes the failure counting of Ray Train/Tune to ignore spot instance preemption failures by default. This behavior is enabled by the new `RayActorError.preempted` flag introduced in ray-project#41102 that is set if the underlying cluster setup handles the cloud preemption signals properly and sets the preempting node to the `DRAINING` status. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

zhe-thoughts · 2023-12-05T05:15:52Z

@justinvyu Nit:

Users expect different failures types to be handled differently in step 4 above:
We need to update this part of the PR description (at the very top). At this point there are no 4 steps listed

…d failures (#41285) (#41609) Users expect different failures types to be handled differently in step 4 above: * The current behavior is that the count decrements, regardless of the error type. For example, if 3 pre-emptions happen with `max_failures=3`, then the run will end without continuing to recover through preemptions. * With `max_failures=-1` or some large value, there will be an infinite number of retries, but this could crash-loop on an application error (ex: a bug in the user code). This can be very expensive. This PR changes the failure counting of Ray Train/Tune to ignore spot instance preemption failures by default. This behavior is enabled by the new `RayActorError.preempted` flag introduced in #41102 that is set if the underlying cluster setup handles the cloud preemption signals properly and sets the preempting node to the `DRAINING` status. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

anyscalesam · 2023-12-14T00:29:04Z

@justinvyu did we address @zhe-thoughts 's last comment above (it was made post merge and I didn't see any additional links in this ticket).

justinvyu · 2023-12-14T01:08:04Z

Yeah, see the updated PR description.

justinvyu added 6 commits November 20, 2023 15:35

fix failure config error msg

2c80249

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add logic for mocked preemption failure in tune error handling

d676171

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

slight refactor of backend exec for mocking in test

a89ec45

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add test

1b3b51f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

move handle error logic out of schedule_trial_stop, and update should…

72a84fc

…_recover logic accordingly Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update test

dda8466

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from matthewdeng, vitsai and woshiyyya November 21, 2023 01:19

justinvyu assigned matthewdeng and woshiyyya Nov 21, 2023

justinvyu commented Nov 21, 2023

View reviewed changes

anyscalesam reviewed Nov 22, 2023

View reviewed changes

matthewdeng reviewed Nov 22, 2023

View reviewed changes

justinvyu added 2 commits November 27, 2023 09:07

Merge branch 'master' of https://github.com/ray-project/ray into hand…

5050aef

…le_spot_instance_failures

fix failing test (and remove unneeded tune restore error)

80eeb5f

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu marked this pull request as ready for review November 27, 2023 20:42

justinvyu requested a review from matthewdeng November 27, 2023 20:43

matthewdeng reviewed Nov 28, 2023

View reviewed changes

justinvyu added 10 commits November 28, 2023 08:17

rework test try 1

ba62f43

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Revert "rework test try 1"

14cda17

This reverts commit ba62f43. Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into hand…

8fc0c1b

…le_spot_instance_failures

use public as_instanceof_cause

2a75f6e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

use correct core api

9bdfb44

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add configuration env var

dbfc3d3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add unit test

ba3e298

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

remove .run_metadata

6d42044

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

revamp the integration test

3205f05

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add todo to remove

c5bd712

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into hand…

c4bdb21

…le_spot_instance_failures

justinvyu requested a review from matthewdeng November 29, 2023 13:57

matthewdeng approved these changes Nov 29, 2023

View reviewed changes

python/ray/train/constants.py Show resolved Hide resolved

python/ray/train/constants.py Outdated Show resolved Hide resolved

python/ray/tune/tests/test_trial.py Outdated Show resolved Hide resolved

python/ray/tune/experiment/trial.py Outdated Show resolved Hide resolved

justinvyu added 7 commits November 29, 2023 19:37

fix ray init error

39501f4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

rename env var

3df81fa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into hand…

546843d

…le_spot_instance_failures

increase test timeouts

a83fae2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into hand…

df5c05a

…le_spot_instance_failures

remove the thing i need to remove

1a44f3d

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix test for preempted property

714e1a8

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu merged commit 045fded into ray-project:master Dec 5, 2023

justinvyu deleted the handle_spot_instance_failures branch December 5, 2023 04:29

justinvyu mentioned this pull request Dec 5, 2023

[Train/Tune] Skip incrementing failure counter on preemption node died failures #41609

Merged

8 tasks

Conversation

justinvyu commented Nov 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Example

Miscellaneous

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anyscalesam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthewdeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

justinvyu commented Nov 21, 2023 •

edited

Loading

justinvyu Nov 22, 2023 •

edited

Loading

justinvyu Nov 22, 2023 •

edited

Loading

anyscalesam left a comment •

edited

Loading

justinvyu Nov 22, 2023 •

edited

Loading