[train+tune] Refactor restoration configuration to be centered around storage_path#42853
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ore_terminated_trainer
This reverts commit b6a8b81. Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ore_terminated_trainer
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
matthewdeng
left a comment
There was a problem hiding this comment.
Overall looks good to me, thanks for the much cleaner API. Main consideration is just if we choose to expose this publicly or keep it internal.
| Members: | ||
| RESUME: Resume from the latest checkpoint. | ||
| RESTART: Restart from the beginning (with no checkpoint). | ||
| IGNORE: Ignore this trial. |
There was a problem hiding this comment.
nit: Upon first read I wasn't sure what "ignore" meant, but not sure off the top of my head if there's a clearer word we can use here.
There was a problem hiding this comment.
Same feeling for me. SKIP might be better?
woshiyyya
left a comment
There was a problem hiding this comment.
Great simplification with _ResumeConfig API! Left some comments
| Members: | ||
| RESUME: Resume from the latest checkpoint. | ||
| RESTART: Restart from the beginning (with no checkpoint). | ||
| IGNORE: Ignore this trial. |
There was a problem hiding this comment.
Same feeling for me. SKIP might be better?
python/ray/tune/tune.py
Outdated
| fail_fast: bool = False, | ||
| restore: Optional[str] = None, | ||
| resume: Union[bool, str] = False, | ||
| resume_config: Optional[_ResumeConfig] = None, |
There was a problem hiding this comment.
Should we just make it a public API? The users will always use this config but cannot find it on our Ray API doc.
There was a problem hiding this comment.
Decision: Keep it as a DeveloperAPI for now, and expose it as a "hidden" experimental argument of Tuner.restore. Ideally, we wouldn't have this, but there is the dependency of Trainer.restore on Tuner.restore for now 😢
…ore_terminated_trainer Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…ore_terminated_trainer
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Why are these changes needed?
This PR cleans up restoration logic in Tune.
Context
Previously, we had
local_dirandupload_diras two separate configurations for where to store experiment results. Results would first go tolocal_dir, then toupload_dir. (Checkpoints were the exception, which had a special codepath to upload directly to cloud.)Now, the
storage_pathis the only location that users can expect all their run data to go to. The local directory just serves as a staging ground to eventually copy to storage. This is an implementation detail that ideally doesn't need to exist.At the moment, it's not actually possible to recover successfully from the "local directory" (set by the
RAY_AIR_LOCAL_CACHE_DIRenvironment variable), when a storage path is set. Checkpoints are uploaded directly to the storage path, so restoring from the local staging directory will result in trials starting from scratch.Therefore, the
resume="REMOTE"andresume="LOCAL"configurations are outdated artifacts of the old (<2.7) checkpointing/storage implementation.resume="AUTO"is now the only codepath, which greatly simplifies the restoration code.Change Summary
Given this context, these are the main changes from this PR:
"+". (Ex:AUTO+RESTART_ERRORED-->ResumeConfig(errored=ResumeConfig.ResumeType.RESTART).)ResumeType.IGNOREsetting that can be used to only restart/resume subsets of trials based on their statuses.resume_from_checkpointto start a new experiment is what we recommend), but it is a common user ask to be able to continue the run, while still tracking the top K checkpoints.API Change Summary
Tuner.restore(..., _resume_config)is now possible, but this change is backwards compatible, sinceTuner.restore(resume_errored, ...)etc. are still possible.tune.run(resume_config)is now the simpler alternative totune.run(resume), and some legacy resume strings have been hard-deprecated. There is no hard API change here though.TuneController(resume)is nowTuneController(resume_config)Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.