[train+tune] Local directory refactor (3/n): Revert to async experiment state snapshotting#43689
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…rate_driver_and_trial_artifacts
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> update trainer._save usage in test Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ting for driver sync Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…open?) Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…w syncs Signed-off-by: Justin Yu <justinvyu@anyscale.com>
| exc_info=True, | ||
| ) | ||
|
|
||
| if force: |
There was a problem hiding this comment.
Please correct me if I'm wrong:
self._storage.syncer.sync_up() returns False if there's an ongoing syncing, otherwise launch a new syncing and returns True. (Nit: maybe update the behavior in the doctring? although it's an internal api)
The way we achieve "force sync_up" is to wait for the ongoing syncing to finish, thus we can always trigger a new syncing when we call sync_up() later.
No matter launched_sync is True(Launced a new syncing) or False(there's an ongoing syncing), we will not interrupt the current syncing and just let it to finish before timeout (1800s)
There was a problem hiding this comment.
Yes, this is correct. We never interrupt an existing sync, but we do a blocking wait on the existing sync if force=True.
| "The previous sync of the experiment directory to the cloud " | ||
| f"failed with the error: {str(e)}\nSyncing will be retried." | ||
| "Experiment state snapshotting has been triggered multiple " | ||
| f"times in the last {self._excessive_sync_threshold} seconds. " |
There was a problem hiding this comment.
_excessive_sync_threshold = TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S = 5
This is just a warning that let the users not do checkpoint too frequently in general. Why do we specifically mention num_to_keep and the force logic here? If force=False, it's still possible to raise this warning if checkpoint too frequently.
There was a problem hiding this comment.
nit: Also, why not suggest increase the sync_up period? instead of reduce the warning period?
There was a problem hiding this comment.
If force=False, the checkpointing period is max(10, auto adjusted period), so we shouldn't be running into excessive syncs in the default case.
So, this message is just for the forced case.
There was a problem hiding this comment.
why not suggest increase the sync_up period? instead of reduce the warning period?
num_to_keep will always cause experiment snapshots to be forced, which disregards the checkpoint period. So increasing the value of TUNE_GLOBAL_CHECKPOINT_S doesn't actually do anything. So, the only fixes are to increase num_to_keep or just accept this and suppress the warning. 🙁
| self._trial_num_checkpoints_since_last_sync[trial] | ||
| >= self._sync_every_n_trial_checkpoints | ||
| ): | ||
| self._should_force_sync_up = True |
There was a problem hiding this comment.
-> I see. This is the workaround for num_to_keep.
_sync_every_n_trial_checkpoints = CheckpointConfig.num_to_keep
https://github.com/justinvyu/ray/blob/a2fb4423906874ed988a860291c398721d54b736/python/ray/tune/execution/tune_controller.py#L319
We only enable force sync up every num_to_keep checkpoints. And that's why will only raised that excessive checkpoint warning when num_to_keep is set.
There was a problem hiding this comment.
Currently the exp sync_up behavior is affected by the per-trial CheckpointManager behavior.
One idea: In the future, we should let each trial maintains it's own latest checkpoint path, experiment states only keep the status of all the trials. Then we don't have to worry about the checkpoint mismatch problem between exp state and per-trial checkpoint folder.
There was a problem hiding this comment.
Currently the exp sync_up behavior is affected by the per-trial CheckpointManager behavior.
Yes, this is a design flaw that we just have to workaround for now. We definitely want to learn this lesson when designing a new system.
|
Overall looks good to me! Left some comments |
thomasdesr
left a comment
There was a problem hiding this comment.
For usage proto changes
Why are these changes needed?
A hanging/failing driver file upload will currently block/fail the tune control loop, even though all trials may be running fine. This regression is a side-effect of #43403, which made a behavior change to increase the freshness of the experiment state files in storage. (See the "Experiment checkpoint saving and uploading now happens synchronously." bullet point in that PR description.) Prior to that PR, we would do driver syncing asynchronously -- if it failed, we'd catch and log the error; if it hung, we'd timeout after 30 minutes and log a warning.
This PR switches back to asynchronous driver syncing and adds back the
num_to_keepforceful experiment state snapshot mitigation. After this PR, the change compared to what we had in 2.9.3 is the part in strikethrough:TUNE_GLOBAL_CHECKPOINT_Sseconds (every ~10 seconds).CheckpointConfig(num_to_keep)is set, then force a new sync to be launched if any trial has reported more thannum_to_keepcheckpoints since the last sync. Force a new sync by waiting on the previous one first (a blocking call) and launching a new one.num_to_keep=1since it blocks the execution loop for too long. This should be reworked soon.If we have already synced up within the lastsync_period(default = 5 minutes), then skip the sync up.Related issue number
Closes #43746
Closes #43748
Closes #43747
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.