-
Notifications
You must be signed in to change notification settings - Fork 19
Description
This is wrt the instructions given here to fix any crashing or failed evaluation (config) in a NePS run.
This is tested on NePS installed from the main branch.
Simple code example:
import numpy as np
import neps
if __name__=="__main__":
pipeline_space = {
"a": neps.Integer(1, 100, prior=50),
"z": neps.Integer(1, 10, is_fidelity=True)
}
def _run(**config):
## NOTE: line to toggle to simulate crashes
# if np.random.uniform() > 0.95:
# raise RuntimeError("Random failure to test NePS robustness.")
return -config["a"] ** config["z"]
neps.run(
evaluate_pipeline=_run,
pipeline_space=pipeline_space,
root_directory="./neps_test",
optimizer=("priorband", {"eta": 3}),
fidelities_to_spend=100, # NOTE: increase budget here to `resume` a crashed run, if required
)
# end of fileIt appears that if optimizer_state.pkl is deleted as per the documentation instructions, the NePS re-run always has the error. If only the .trial_cache.pkl is deleted then NePS is able to reload the state but unable to properly resume (or I am unable to).
Here are some of the steps to recreate the issue, for a 1-worker run:
-
run the above script with
fidelities_to_spend=100and let the run finish, to test setup -
uncomment the if-raise in
def _run()and rerun the above (resume run) with nowfidelities_to_spend=200- this should throw an error and kill the NePS process
-
deleting
.trial_cache.pkl, and- changing the failed config's
metadata.jsonto be"state": "pending" - commenting the
if-raiselines indef _run()
- changing the failed config's
-
re-running script --> the error still persists
-
[alternatively] deleting both
.trial_cache.pklandoptimizer_state.pklboth results in:RuntimeError: Failed to create or load the NePS state after 10 attempts. Bailing! Please enable debug logging to see the errors that occured.
Two main questions:
- how does one successfully resume here (fix crashing evaluations)
- should the documentation be updated suitably
Thanks!
Please let me know if any more clarification can be helpful.
PS: this setup to succesfully resume crashes and with ignore_errors=False is quite crucial for grid_search or other finite categorical spaces, quite common in Deep Learning
PPS: this small example maybe too simplistic, happy to discuss the exact DL setup for which this issue shows up
✌️
Metadata
Metadata
Assignees
Labels
Type
Projects
Status