Skip to content

[tool, perf] feat: add reward timing metrics in agent loop#5896

Open
guillemgt wants to merge 1 commit intoverl-project:mainfrom
guillemgt:upstreaming/agent-loop-reward-timing
Open

[tool, perf] feat: add reward timing metrics in agent loop#5896
guillemgt wants to merge 1 commit intoverl-project:mainfrom
guillemgt:upstreaming/agent-loop-reward-timing

Conversation

@guillemgt
Copy link
Copy Markdown
Contributor

What does this PR do?

Currently, the timing_s/reward metric is meaningless when the reward computation happens in the agent loop.

This PR adds per-sample timing for the reward computation (_compute_score) in the agent loop, following the same pattern as existing generate_sequences and tool_calls metrics. This enables identifying whether generation or reward is the bottleneck when both run asynchronously for different samples.

New metrics:

  • timing_s/agent_loop/reward/(min|mean|max) — per-sample reward computation time
  • timing_s/agent_loop/slowest/reward — reward time for the bottleneck sample

The slowest sample calculation now includes reward time in the total.

Checklist Before Starting

  • [x Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/pulls?q=is%3Apr+timing
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

This is a metrics-only change (no behavioral changes). The new timing metrics appear alongside existing agent_loop/generate_sequences/* and agent_loop/tool_calls/* metrics when running agent loop training with async rewards.

API and Usage Example

No API changes. New metrics are automatically logged.

Design & Code Changes

Single file change in verl/experimental/agent_loop/agent_loop.py:

  1. Added reward: float = 0.0 field to AgentLoopMetrics
  2. Wrapped _compute_score call with simple_timer in _agent_loop_postprocess
  3. Added min/mean/max aggregation and slowest-sample tracking in _performance_metrics

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces timing measurements for the reward calculation process within the agent loop. It adds a reward field to the AgentLoopMetrics class, utilizes a simple_timer to capture the duration of the _compute_score method, and updates the performance metrics aggregation logic to include statistics for reward timing. A performance optimization was suggested for the _performance_metrics method to avoid redundant iterations over the metrics structure by flattening it once before extracting individual fields.

Comment on lines 1181 to 1184
t_generate_sequences = np.array([metric["generate_sequences"] for chunk in metrics for metric in chunk])
t_tool_calls = np.array([metric["tool_calls"] for chunk in metrics for metric in chunk])
t_reward = np.array([metric["reward"] for chunk in metrics for metric in chunk])
num_preempted = np.array([metric["num_preempted"] for chunk in metrics for metric in chunk])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation iterates over the nested metrics structure four times to create NumPy arrays. For large batches or many workers, this can be optimized by flattening the metrics once and then extracting the required fields. This improves efficiency, which aligns with the [perf] tag in the PR title.

Suggested change
t_generate_sequences = np.array([metric["generate_sequences"] for chunk in metrics for metric in chunk])
t_tool_calls = np.array([metric["tool_calls"] for chunk in metrics for metric in chunk])
t_reward = np.array([metric["reward"] for chunk in metrics for metric in chunk])
num_preempted = np.array([metric["num_preempted"] for chunk in metrics for metric in chunk])
flat_metrics = [metric for chunk in metrics for metric in chunk]
t_generate_sequences = np.array([m["generate_sequences"] for m in flat_metrics])
t_tool_calls = np.array([m["tool_calls"] for m in flat_metrics])
t_reward = np.array([m["reward"] for m in flat_metrics])
num_preempted = np.array([m["num_preempted"] for m in flat_metrics])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant