[tool, perf] feat: add reward timing metrics in agent loop by guillemgt · Pull Request #5896 · verl-project/verl

guillemgt · 2026-04-07T08:53:05Z

What does this PR do?

Currently, the timing_s/reward metric is meaningless when the reward computation happens in the agent loop.

This PR adds per-sample timing for the reward computation (_compute_score) in the agent loop, following the same pattern as existing generate_sequences and tool_calls metrics. This enables identifying whether generation or reward is the bottleneck when both run asynchronously for different samples.

New metrics:

timing_s/agent_loop/reward/(min|mean|max) — per-sample reward computation time
timing_s/agent_loop/slowest/reward — reward time for the bottleneck sample

The slowest sample calculation now includes reward time in the total.

Checklist Before Starting

[x Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/pulls?q=is%3Apr+timing
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

This is a metrics-only change (no behavioral changes). The new timing metrics appear alongside existing agent_loop/generate_sequences/* and agent_loop/tool_calls/* metrics when running agent loop training with async rewards.

API and Usage Example

No API changes. New metrics are automatically logged.

Design & Code Changes

Single file change in verl/experimental/agent_loop/agent_loop.py:

Added reward: float = 0.0 field to AgentLoopMetrics
Wrapped _compute_score call with simple_timer in _agent_loop_postprocess
Added min/mean/max aggregation and slowest-sample tracking in _performance_metrics

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation. (not necessary)
Add unit or end-to-end test(s) to the CI workflow to cover all the code. (not necessary)
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main. (not applicable)

gemini-code-assist

Code Review

This pull request introduces timing measurements for the reward calculation process within the agent loop. It adds a reward field to the AgentLoopMetrics class, utilizes a simple_timer to capture the duration of the _compute_score method, and updates the performance metrics aggregation logic to include statistics for reward timing. A performance optimization was suggested for the _performance_metrics method to avoid redundant iterations over the metrics structure by flattening it once before extracting individual fields.

gemini-code-assist · 2026-04-07T08:59:41Z

verl/experimental/agent_loop/agent_loop.py

        t_generate_sequences = np.array([metric["generate_sequences"] for chunk in metrics for metric in chunk])
        t_tool_calls = np.array([metric["tool_calls"] for chunk in metrics for metric in chunk])
+        t_reward = np.array([metric["reward"] for chunk in metrics for metric in chunk])
        num_preempted = np.array([metric["num_preempted"] for chunk in metrics for metric in chunk])


The current implementation iterates over the nested metrics structure four times to create NumPy arrays. For large batches or many workers, this can be optimized by flattening the metrics once and then extracting the required fields. This improves efficiency, which aligns with the [perf] tag in the PR title.

Suggested change

t_generate_sequences = np.array([metric["generate_sequences"] for chunk in metrics for metric in chunk])

t_tool_calls = np.array([metric["tool_calls"] for chunk in metrics for metric in chunk])

t_reward = np.array([metric["reward"] for chunk in metrics for metric in chunk])

num_preempted = np.array([metric["num_preempted"] for chunk in metrics for metric in chunk])

flat_metrics = [metric for chunk in metrics for metric in chunk]

t_generate_sequences = np.array([m["generate_sequences"] for m in flat_metrics])

t_tool_calls = np.array([m["tool_calls"] for m in flat_metrics])

t_reward = np.array([m["reward"] for m in flat_metrics])

num_preempted = np.array([m["num_preempted"] for m in flat_metrics])

[tool, perf] feat: add per-sample reward timing metrics in agent loop

2e59b9b

guillemgt requested review from ArronHZG and wuxibin89 as code owners April 7, 2026 08:53

gemini-code-assist bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tool, perf] feat: add reward timing metrics in agent loop#5896

[tool, perf] feat: add reward timing metrics in agent loop#5896
guillemgt wants to merge 1 commit intoverl-project:mainfrom
guillemgt:upstreaming/agent-loop-reward-timing

guillemgt commented Apr 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        t_generate_sequences = np.array([metric["generate_sequences"] for chunk in metrics for metric in chunk])
-        t_tool_calls = np.array([metric["tool_calls"] for chunk in metrics for metric in chunk])
-        t_reward = np.array([metric["reward"] for chunk in metrics for metric in chunk])
-        num_preempted = np.array([metric["num_preempted"] for chunk in metrics for metric in chunk])
+        flat_metrics = [metric for chunk in metrics for metric in chunk]
+        t_generate_sequences = np.array([m["generate_sequences"] for m in flat_metrics])
+        t_tool_calls = np.array([m["tool_calls"] for m in flat_metrics])
+        t_reward = np.array([m["reward"] for m in flat_metrics])
+        num_preempted = np.array([m["num_preempted"] for m in flat_metrics])

Conversation

guillemgt commented Apr 7, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant