Skip to content

Issue with handling evaluation crashes #255

@Neeratyoy

Description

@Neeratyoy

This is wrt the instructions given here to fix any crashing or failed evaluation (config) in a NePS run.

This is tested on NePS installed from the main branch.

Simple code example:

import numpy as np

import neps


if __name__=="__main__":

    pipeline_space = {
        "a": neps.Integer(1, 100, prior=50),
        "z": neps.Integer(1, 10, is_fidelity=True)
    }

    def _run(**config):

        ## NOTE: line to toggle to simulate crashes
        # if np.random.uniform() > 0.95:
        #     raise RuntimeError("Random failure to test NePS robustness.")

        return -config["a"] ** config["z"]

    neps.run(
        evaluate_pipeline=_run,
        pipeline_space=pipeline_space,
        root_directory="./neps_test",
        optimizer=("priorband", {"eta": 3}),
        fidelities_to_spend=100,   # NOTE: increase budget here to `resume` a crashed run, if required
    )
# end of file

It appears that if optimizer_state.pkl is deleted as per the documentation instructions, the NePS re-run always has the error. If only the .trial_cache.pkl is deleted then NePS is able to reload the state but unable to properly resume (or I am unable to).

Here are some of the steps to recreate the issue, for a 1-worker run:

  • run the above script with fidelities_to_spend=100 and let the run finish, to test setup

  • uncomment the if-raise in def _run() and rerun the above (resume run) with now fidelities_to_spend=200

    • this should throw an error and kill the NePS process
  • deleting .trial_cache.pkl, and

    • changing the failed config'smetadata.json to be "state": "pending"
    • commenting the if-raise lines in def _run()
  • re-running script --> the error still persists

  • [alternatively] deleting both .trial_cache.pkl and optimizer_state.pkl both results in:

    • RuntimeError: Failed to create or load the NePS state after 10 attempts. Bailing! Please enable debug logging to see the errors that occured.

Two main questions:

  1. how does one successfully resume here (fix crashing evaluations)
  2. should the documentation be updated suitably

Thanks!
Please let me know if any more clarification can be helpful.

PS: this setup to succesfully resume crashes and with ignore_errors=False is quite crucial for grid_search or other finite categorical spaces, quite common in Deep Learning

PPS: this small example maybe too simplistic, happy to discuss the exact DL setup for which this issue shows up

✌️

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions