fix: harden AutoRL-Bench RL evaluators by couragec · Pull Request #1402 · microsoft/RD-Agent

couragec · 2026-04-28T10:48:33Z

Summary

This PR hardens the AutoRL-Bench RL evaluators by backporting a small set
of targeted fixes:

Fix ALFWorld vLLM evaluation crashes from overlong ReAct history by
truncating prompts to fit the configured context window.
Ensure ALFWorld restores stdout and cleans up the environment / vLLM
backend even when evaluation fails.
Propagate vLLM-safe environment variables to OpenCompass subprocesses
and set enforce_eager=True in the OpenCompass template.
Prevent failed baseline evaluations from silently becoming cached or
returned as valid 0.0 scores.
Improve timeout cleanup by killing the process session/tree, while
keeping kill_process_group as a compatibility alias.

python -m py_compile rdagent/scenarios/rl/autorl_bench/benchmarks/ alfworld/eval.py rdagent/scenarios/rl/autorl_bench/core/evaluator.py rdagent/scenarios/rl/autorl_bench/core/utils.py rdagent/scenarios/rl/ autorl_bench/core/opencompass.py rdagent/scenarios/rl/autorl_bench/core/ __init__.py rdagent/scenarios/rl/autorl_bench/test/test_fixes.py
ruff check --no-fix --select F,E9 rdagent/scenarios/rl/autorl_bench/ benchmarks/alfworld/eval.py rdagent/scenarios/rl/autorl_bench/core/ evaluator.py rdagent/scenarios/rl/autorl_bench/core/utils.py rdagent/ scenarios/rl/autorl_bench/core/opencompass.py rdagent/scenarios/rl/ autorl_bench/core/__init__.py rdagent/scenarios/rl/autorl_bench/test/ test_fixes.py
/tmp/rdagent-test-venv/bin/python -m rdagent.scenarios.rl.autorl_bench.test.test_fixes

Result: 24 passed, 0 failed.

couragec added 3 commits April 28, 2026 18:45

fix: harden AutoRL-Bench RL evaluators

0524fcc

style: sort AutoRL-Bench test imports

6af6133

style: format AutoRL-Bench OpenCompass error message

922d23e

couragec merged commit 0ca1609 into main Apr 28, 2026
9 checks passed

couragec deleted the fix/autorl-bench-rl-bugfixes branch April 28, 2026 11:08