[train+tune] Local directory refactor (3/n): Revert to async experiment state snapshotting#43689

Merged

justinvyu merged 150 commits intoray-project:masterfrom

justinvyu:handle_upload_timeout

Mar 9, 2024

Contributor

justinvyu commented Mar 4, 2024 •

edited

Loading

Why are these changes needed?

A hanging/failing driver file upload will currently block/fail the tune control loop, even though all trials may be running fine. This regression is a side-effect of #43403, which made a behavior change to increase the freshness of the experiment state files in storage. (See the "Experiment checkpoint saving and uploading now happens synchronously." bullet point in that PR description.) Prior to that PR, we would do driver syncing asynchronously -- if it failed, we'd catch and log the error; if it hung, we'd timeout after 30 minutes and log a warning.

This PR switches back to asynchronous driver syncing and adds back the num_to_keep forceful experiment state snapshot mitigation. After this PR, the change compared to what we had in 2.9.3 is the part in strikethrough:

Write the "local" copy of these experiment state files every TUNE_GLOBAL_CHECKPOINT_S seconds (every ~10 seconds).
If CheckpointConfig(num_to_keep) is set, then force a new sync to be launched if any trial has reported more than num_to_keep checkpoints since the last sync. Force a new sync by waiting on the previous one first (a blocking call) and launching a new one.
- NOTE: This is a hack to keep the driver experiment state as consistent as possible so that trials don't. This is pretty unusable when you have more than ~10 trials with num_to_keep=1 since it blocks the execution loop for too long. This should be reworked soon.
- This PR adds this mitigation back, since other solutions are too risky at the moment, and it's better to just fall back to the existing behavior.
Else:
- Try to launch a sync up task to storage after writing the files to local. ~~If we have already synced up within the last sync_period (default = 5 minutes), then skip the sync up.~~
- If the previous sync task is still running, skip launching a new sync up task.
At the end of the experiment, wait for the latest sync up if one is running. Then, launch a final driver sync up and wait for that one to also finish.

Related issue number

Closes #43746
Closes #43748
Closes #43747

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

justinvyu added 30 commits

February 21, 2024 14:58


          add util for getting ray train session tmp dir

c157ad9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into sepa…

…rate_driver_and_trial_artifacts


          remove storage local path and introduce driver staging + working dirs

ad3b136

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          update trial chdir to use trial_working_dir

d8b4800

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          rename experiment_local_path -> experiment_local_staging_path

614fefa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          rename trial_local_path -> trial_local_staging_path

08d9892

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          fix incorrect worker artifact sync dir

dd1f202

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          update syncer = None codepaths

aab939e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          fix test_storage

f981b6a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          fix cwd assert to use resolved path in test

7d45920

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          storage_path default = ~/ray_results

41ef191

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          upload trainer pkl directly

bf323fa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update trainer._save usage in test

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          upload tuner pkl directly

dee3249

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          revert storage path default

bdd58b3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          fix optional storage path dependencies for now

0bd1a58

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          remove todo

24a9fc0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          small correction...

90be9ca

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…

197a29e

…ifacts


          remove ipdb

c18384a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…

4978c45

…ifacts


          remove some hacks in test

f55ace2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          upload exp state (with trial states) directly to cloud instead of wai…

75ef6bd

…ting for driver sync

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          use converted trainable in tuner entrypoint

cab5e12

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          use non-optional run config

09e0273

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          remove local restoration test

c0d0ba0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          keep base trainer and tuner exp dir name resolution consistent

011ac92

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          add test case for restoration with default RunConfig(name)

c3c03ac

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into uplo…

25cccb5

…ad_pkl_directly


          Merge branch 'upload_pkl_directly' into separate_driver_and_trial_art…

0a0e37c

…ifacts


          storage path = ~/ray_results by default

2577b91

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 11 commits

March 6, 2024 17:48


          log how long the final forced exp ckpt takes at experiment end

56c95a2

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          finish env var explanation

ba4edb9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          fix file descriptor test (too many syncs causing more fds to be left …

07cfc23

…open?)

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into make…

4a15c71

…_exp_ckpt_async


          switch away from global_checkpoint_s=0 workaround

66f27fe

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          remove some fixed todos

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          update warning wording

05ee692

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          update wording exp ckpt -> exp state snapshot in more places

d46ed22

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          fix type hints for on_save_result

470ff0e

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          update wording of global_ckpt_s env var description

77b2f10

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          slow sync threshold now makes sense again

437f3c0

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from woshiyyya

March 7, 2024 21:58

justinvyu commented

View reviewed changes

python/ray/tune/execution/experiment_state.py Outdated Show resolved Hide resolved

justinvyu added 4 commits

March 7, 2024 19:19


          add back num_to_keep mitigation :)

1f29611

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into make…

…_exp_ckpt_async


          Add back error/warning messages to notify users about bottlenecks/slo…

…w syncs

Signed-off-by: Justin Yu <justinvyu@anyscale.com>


          Merge branch 'master' of https://github.com/ray-project/ray into make…

a2fb442

…_exp_ckpt_async

woshiyyya reviewed

View reviewed changes

python/ray/tune/execution/experiment_state.py

+                                  exc_info=True,
+                              )
                       if force:

Member

woshiyyya Mar 8, 2024 •

edited

Loading

Please correct me if I'm wrong:

self._storage.syncer.sync_up() returns False if there's an ongoing syncing, otherwise launch a new syncing and returns True. (Nit: maybe update the behavior in the doctring? although it's an internal api)

The way we achieve "force sync_up" is to wait for the ongoing syncing to finish, thus we can always trigger a new syncing when we call sync_up() later.

No matter launched_sync is True(Launced a new syncing) or False(there's an ongoing syncing), we will not interrupt the current syncing and just let it to finish before timeout (1800s)

Contributor Author

justinvyu Mar 8, 2024

Yes, this is correct. We never interrupt an existing sync, but we do a blocking wait on the existing sync if force=True.

woshiyyya reviewed

View reviewed changes

python/ray/tune/execution/experiment_state.py

-                                  "The previous sync of the experiment directory to the cloud "
-                                  f"failed with the error: {str(e)}\nSyncing will be retried."
+                                  "Experiment state snapshotting has been triggered multiple "
+                                  f"times in the last {self._excessive_sync_threshold} seconds. "

Member

woshiyyya Mar 8, 2024 •

edited

Loading

_excessive_sync_threshold = TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S = 5

This is just a warning that let the users not do checkpoint too frequently in general. Why do we specifically mention num_to_keep and the force logic here? If force=False, it's still possible to raise this warning if checkpoint too frequently.

Member

woshiyyya Mar 8, 2024 •

edited

Loading

nit: Also, why not suggest increase the sync_up period? instead of reduce the warning period?

Contributor Author

justinvyu Mar 8, 2024

If force=False, the checkpointing period is max(10, auto adjusted period), so we shouldn't be running into excessive syncs in the default case.

So, this message is just for the forced case.

Contributor Author

justinvyu Mar 8, 2024

why not suggest increase the sync_up period? instead of reduce the warning period?

num_to_keep will always cause experiment snapshots to be forced, which disregards the checkpoint period. So increasing the value of TUNE_GLOBAL_CHECKPOINT_S doesn't actually do anything. So, the only fixes are to increase num_to_keep or just accept this and suppress the warning. 🙁

woshiyyya reviewed

View reviewed changes

python/ray/tune/execution/experiment_state.py

+                          self._trial_num_checkpoints_since_last_sync[trial]
+                          >= self._sync_every_n_trial_checkpoints
+                      ):
+                          self._should_force_sync_up = True

Member

woshiyyya Mar 8, 2024 •

edited

Loading

-> I see. This is the workaround for num_to_keep.

_sync_every_n_trial_checkpoints = CheckpointConfig.num_to_keep
https://github.com/justinvyu/ray/blob/a2fb4423906874ed988a860291c398721d54b736/python/ray/tune/execution/tune_controller.py#L319

We only enable force sync up every num_to_keep checkpoints. And that's why will only raised that excessive checkpoint warning when num_to_keep is set.

Member

woshiyyya Mar 8, 2024 •

edited

Loading

Currently the exp sync_up behavior is affected by the per-trial CheckpointManager behavior.

One idea: In the future, we should let each trial maintains it's own latest checkpoint path, experiment states only keep the status of all the trials. Then we don't have to worry about the checkpoint mismatch problem between exp state and per-trial checkpoint folder.

Contributor Author

justinvyu Mar 8, 2024

Yep that is it 😢

Contributor Author

justinvyu Mar 8, 2024

Currently the exp sync_up behavior is affected by the per-trial CheckpointManager behavior.

Yes, this is a design flaw that we just have to workaround for now. We definitely want to learn this lesson when designing a new system.

Member

woshiyyya commented Mar 8, 2024 •

edited

Loading

Overall looks good to me! Left some comments

justinvyu changed the title ~~[train+tune] Local directory refactor (3/n): Add back timeout+failure handling on experiment checkpointing~~ [train+tune] Local directory refactor (3/n): Revert to async experiment state snapshotting

thomasdesr approved these changes

View reviewed changes

Contributor

thomasdesr left a comment

For usage proto changes

woshiyyya approved these changes

View reviewed changes

justinvyu merged commit 7064a66 into ray-project:master

justinvyu deleted the handle_upload_timeout branch

March 9, 2024 00:00

This was referenced Mar 9, 2024

[no_ci][WIP][train/tune] Prototype removing local_dir #41041

Closed

[tune] Fix flaky test_controller_checkpointing_integration test suite #43880

Merged

justinvyu mentioned this pull request

[train+tune][doc] Remove docs sections recommending RAY_AIR_LOCAL_CACHE_DIR #44284

Merged

8 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

thomasdesr thomasdesr approved these changes

woshiyyya woshiyyya approved these changes

richardliaw Awaiting requested review from richardliaw

krfricke Awaiting requested review from krfricke

xwjiang2010 Awaiting requested review from xwjiang2010

amogkam Awaiting requested review from amogkam

matthewdeng Awaiting requested review from matthewdeng

Yard1 Awaiting requested review from Yard1

maxpumperla Awaiting requested review from maxpumperla

pcmoritz Awaiting requested review from pcmoritz pcmoritz is a code owner

Labels

None yet