[tool, perf] feat: add reward timing metrics in agent loop#5896
[tool, perf] feat: add reward timing metrics in agent loop#5896guillemgt wants to merge 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces timing measurements for the reward calculation process within the agent loop. It adds a reward field to the AgentLoopMetrics class, utilizes a simple_timer to capture the duration of the _compute_score method, and updates the performance metrics aggregation logic to include statistics for reward timing. A performance optimization was suggested for the _performance_metrics method to avoid redundant iterations over the metrics structure by flattening it once before extracting individual fields.
| t_generate_sequences = np.array([metric["generate_sequences"] for chunk in metrics for metric in chunk]) | ||
| t_tool_calls = np.array([metric["tool_calls"] for chunk in metrics for metric in chunk]) | ||
| t_reward = np.array([metric["reward"] for chunk in metrics for metric in chunk]) | ||
| num_preempted = np.array([metric["num_preempted"] for chunk in metrics for metric in chunk]) |
There was a problem hiding this comment.
The current implementation iterates over the nested metrics structure four times to create NumPy arrays. For large batches or many workers, this can be optimized by flattening the metrics once and then extracting the required fields. This improves efficiency, which aligns with the [perf] tag in the PR title.
| t_generate_sequences = np.array([metric["generate_sequences"] for chunk in metrics for metric in chunk]) | |
| t_tool_calls = np.array([metric["tool_calls"] for chunk in metrics for metric in chunk]) | |
| t_reward = np.array([metric["reward"] for chunk in metrics for metric in chunk]) | |
| num_preempted = np.array([metric["num_preempted"] for chunk in metrics for metric in chunk]) | |
| flat_metrics = [metric for chunk in metrics for metric in chunk] | |
| t_generate_sequences = np.array([m["generate_sequences"] for m in flat_metrics]) | |
| t_tool_calls = np.array([m["tool_calls"] for m in flat_metrics]) | |
| t_reward = np.array([m["reward"] for m in flat_metrics]) | |
| num_preempted = np.array([m["num_preempted"] for m in flat_metrics]) |
What does this PR do?
Currently, the
timing_s/rewardmetric is meaningless when the reward computation happens in the agent loop.This PR adds per-sample timing for the reward computation (
_compute_score) in the agent loop, following the same pattern as existinggenerate_sequencesandtool_callsmetrics. This enables identifying whether generation or reward is the bottleneck when both run asynchronously for different samples.New metrics:
timing_s/agent_loop/reward/(min|mean|max)— per-sample reward computation timetiming_s/agent_loop/slowest/reward— reward time for the bottleneck sampleThe slowest sample calculation now includes reward time in the total.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
This is a metrics-only change (no behavioral changes). The new timing metrics appear alongside existing
agent_loop/generate_sequences/*andagent_loop/tool_calls/*metrics when running agent loop training with async rewards.API and Usage Example
No API changes. New metrics are automatically logged.
Design & Code Changes
Single file change in
verl/experimental/agent_loop/agent_loop.py:reward: float = 0.0field toAgentLoopMetrics_compute_scorecall withsimple_timerin_agent_loop_postprocess_performance_metricsChecklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main. (not applicable)