Issue with handling evaluation crashes

This is wrt the instructions [given here](https://automl.github.io/neps/latest/reference/neps_run/#re-running-failed-configurations) to fix any crashing or failed evaluation (config) in a NePS run.

This is tested on NePS installed from the [main branch](https://github.com/automl/neps/commit/9c84805a062f6b80f9870d0b4e205760b1506b04).

Simple code example:
```python
import numpy as np

import neps


if __name__=="__main__":

    pipeline_space = {
        "a": neps.Integer(1, 100, prior=50),
        "z": neps.Integer(1, 10, is_fidelity=True)
    }

    def _run(**config):

        ## NOTE: line to toggle to simulate crashes
        # if np.random.uniform() > 0.95:
        #     raise RuntimeError("Random failure to test NePS robustness.")

        return -config["a"] ** config["z"]

    neps.run(
        evaluate_pipeline=_run,
        pipeline_space=pipeline_space,
        root_directory="./neps_test",
        optimizer=("priorband", {"eta": 3}),
        fidelities_to_spend=100,   # NOTE: increase budget here to `resume` a crashed run, if required
    )
# end of file
```

It appears that if `optimizer_state.pkl` is deleted as per the documentation instructions, the NePS re-run always has the error. If only the `.trial_cache.pkl` is deleted then NePS is able to reload the state but unable to properly resume (or I am unable to).


Here are some of the steps to recreate the issue, for a 1-worker run:

* run the above script with `fidelities_to_spend=100` and let the run finish, to test setup
* uncomment the if-raise in `def _run()` and rerun the above (resume run) with now `fidelities_to_spend=200`
  * *this should throw an error and kill the NePS process*
* deleting `.trial_cache.pkl`, and
  * changing the failed config's`metadata.json` to be `"state": "pending"` 
  * commenting the `if-raise` lines in `def _run()`
* re-running script --> the error still persists

* [alternatively] deleting both `.trial_cache.pkl` and `optimizer_state.pkl` both results in:
  * `RuntimeError: Failed to create or load the NePS state after 10 attempts. Bailing! Please enable debug logging to see the errors that occured.`

Two main questions:
1) how does one successfully resume here (fix crashing evaluations)
2) should the documentation be updated suitably

Thanks!
Please let me know if any more clarification can be helpful.


PS: this setup to succesfully resume crashes and with `ignore_errors=False` is quite crucial for `grid_search` or other finite categorical spaces, quite common in Deep Learning

PPS: this small example maybe too simplistic, happy to discuss the exact DL setup for which this issue shows up 

:v:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with handling evaluation crashes #255

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with handling evaluation crashes #255

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions