[train+tune] Local directory refactor (1/n): Write launcher state files (tuner.pkl, trainer.pkl) directly to storage#43369
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> update trainer._save usage in test Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…entrypoints Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
| self._run_config.name | ||
| or StorageContext.get_experiment_dir_name(self.converted_trainable) | ||
| ) | ||
| storage = StorageContext( |
There was a problem hiding this comment.
Should we be trying to instantiate this StorageContext only in one place and then pass it around?
There was a problem hiding this comment.
This is hard to do since we'd need to pass the context through from BaseTrainer.fit to the public Tuner interface.
One alternative we discussed is having a global storage context, but that'd also require a get_or_create_storage_context logic to account for users coming in through any of the 3 entrypoints (tuner, trainer, tune.run).
I will clarify the usage is just to access the path and the filesystem so that we don't re-implement that logic.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com> fix lint Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…pickled one is instead) Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
| return trainable_with_params | ||
|
|
||
| exp_name = "restore_with_params" | ||
| exp_name = f"restore_with_params-{use_function_trainable=}" |
There was a problem hiding this comment.
This change was required to avoid a bug where:
- Both tests would use the same exp name and save things to the same driver staging dir
tuner.pklgets downloaded to the staging dir when the test runs the first time with theuse_function_trainable=True.- Then, when the second run happens, it writes the correct
tuner.pklto storage, but then does another driver sync where the oldtuner.pklthat was downloaded gets uploaded again. - This caused an error when restoring the 2nd run.
This problem gets fixed in the follow-up PR by using unique staging directories for each new ray train experiment.
There was a problem hiding this comment.
Users should never run Tune twice in the same job?
I think this will no longer be a problem after your second PR, which separates driver staging directories with timestamp.
woshiyyya
left a comment
There was a problem hiding this comment.
Looks good to me. Left some minor comments.
| _SELF = "self" | ||
|
|
||
|
|
||
| _TUNER_FAILED_MSG = ( |
There was a problem hiding this comment.
Why we no longer send this failure message?
There was a problem hiding this comment.
This only caught TuneError, which only gets raised from tune.run(raise_on_failed_trial=True). However, the Tuner always passes False for this flag, so this error message doesn't actually get printed ever.
ray/python/ray/tune/impl/tuner_internal.py
Line 594 in 93d8d96
| return trainable_with_params | ||
|
|
||
| exp_name = "restore_with_params" | ||
| exp_name = f"restore_with_params-{use_function_trainable=}" |
There was a problem hiding this comment.
Users should never run Tune twice in the same job?
I think this will no longer be a problem after your second PR, which separates driver staging directories with timestamp.
| assert not (tmp_path / "trainer").exists() | ||
| assert both_msg not in caplog.text | ||
| else: | ||
| assert tuner._local_tuner.get_run_config() == RunConfig() |
There was a problem hiding this comment.
Is it because we inject a default storage context into the run config if not specified?
There was a problem hiding this comment.
This is because I set the run_config.name in the trainer, then pass that along to the tuner. So, the tuner's run config is no longer the default RunConfig. This is so that we don't generate the experiment name multiple times (which could lead to different folders being used by the trainer vs. tuner).
Why are these changes needed?
This PR updates
Trainers and theTunerto upload its state directly tostorage_path, rather than dumping it in a local directory and relying on driver syncing to upload.This removes the dependency of these pkl files on the
_get_defaults_results_dir, which is the folder that defaults to~/ray_resultsand is set by environment variables.TODOs for follow-up PRs
There's currently an issue where multiple runs with the same
RunConfig(name)can produce conflicting versions oftuner.pkl. This can lead to a bug where the files needed to restore a run can be overwritten by a mismatched version. See #43369 (comment) for more details.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.