Releases: NVIDIA-NeMo/Evaluator
Releases · NVIDIA-NeMo/Evaluator
NVIDIA NeMo Evaluator 0.1.77
nemo-evaluator-v0.1.77 Merge branch 'deploy-release/bef4b952-c0f3-40fa-b5fd-320d86b86e37'
NVIDIA NeMo Evaluator Launcher 0.1.78
nemo-evaluator-launcher-v0.1.78 Merge branch 'deploy-release/bef4b952-c0f3-40fa-b5fd-320d86b86e37'
NVIDIA NeMo Evaluator 0.1.76
nemo-evaluator-v0.1.76 feat(slurm): add launcher_install_cmd option for custom auto-export i…
NVIDIA NeMo Evaluator Launcher 0.1.77
nemo-evaluator-launcher-v0.1.77 feat(slurm): add launcher_install_cmd option for custom auto-export i…
NVIDIA NeMo Evaluator 0.1.75
chore: Fix max_walltime docs (#685) Signed-off-by: Wojciech Prazuch <[email protected]>
NVIDIA NeMo Evaluator Launcher 0.1.76
chore: Fix max_walltime docs (#685) Signed-off-by: Wojciech Prazuch <[email protected]>
NVIDIA NeMo Evaluator 0.1.74
nemo-evaluator-v0.1.74 ci: Fix integration test by avoid writing to read-only test directory…
NVIDIA NeMo Evaluator Launcher 0.1.75
nemo-evaluator-launcher-v0.1.75 ci: Fix integration test by avoid writing to read-only test directory…
NVIDIA NeMo Evaluator 0.1.73
fix(slurm): node_array undefined (#671)
## Summary
When running the launcher on Slurm with `deployment.type: none`, the
generated sbatch script could fail at runtime with:
- `line N: nodes_array[0]: unbound variable`
This was triggered by `set -u` (nounset) and an unconditional
`--nodelist ${nodes_array[0]}` in the evaluation client `srun`.
## Impact
- **Configs affected**: any Slurm run with `deployment.type=none` (e.g.,
“target-only” evaluation).
- **Failure mode**: sbatch script exits before launching the evaluation
client.
- **Where observed**: Slurm job log (`slurm_script` / `slurm-%A.log`).
## Direct cause
- The sbatch script enables:
- `set -u` (treat unset variables as an error)
- The evaluation client `srun` was emitted as:
- `srun ... --nodelist ${nodes_array[0]} ...`
- `nodes_array` was only defined inside the deployment block (`if
cfg.deployment.type != "none": ...`).
- Therefore, for `deployment.type=none`, `nodes_array` was undefined and
`${nodes_array[0]}` crashed under nounset.
## Secondary risks (also addressed)
Even when deployment is enabled, `${nodes_array[0]}` can still fail if:
- `$SLURM_JOB_NODELIST` is unset/empty (non-standard environment) or
only `$SLURM_NODELIST` is present.
- `scontrol` is unavailable on the node or not in `PATH`.
- `scontrol show hostnames ...` returns an empty list.
Any of these can result in an empty/unset array index under `set -u`.
## Solution
### Approach
Introduce a **single, always-defined** “node pinning” variable for
single-node sruns:
- `PRIMARY_NODE`
This is resolved at runtime in the sbatch script with safe fallbacks:
1. Prefer `SLURM_JOB_NODELIST`
2. Fallback to `SLURM_NODELIST`
3. Fallback to local `hostname`
---------
Signed-off-by: Alex Gronskiy <[email protected]>
NVIDIA NeMo Evaluator 0.1.72
fix: restore support for running tasks not listed in FDF (#667) We have improved our validation in the spirit of failing early. However, this lead to unwanted side effect - we've lost support for running tasks not listed in FDF with `harness.task` syntax. Calling evaluation with this syntax was resulting in `nemo_evaluator.core.utils.MisconfigurationError: Unknown evaluation xxx` It stopped working because: * we run validation (everything passes here) * then we prepare the config, extracting `task` from `harness.task` and using in as evaluation `type` * we run 2nd validation and it fails because we no longer use `harness.task` syntax and there's no evaluation called `task` in FDF This PR uses `harness.task` as `type` to make sure it's always valid + adds test verifying custom task support. It also removes one redundant validation --------- Signed-off-by: Marta Stepniewska-Dziubinska <[email protected]>