Skip to content

Commit 93f1831

Browse files
[reward] feat: add compute_score timing metrics to agent loop (#5971)
### What does this PR do? Add `compute_score` timing metric to `AgentLoopMetrics` in the agent loop to track the time spent on reward score computation (`_compute_score`). This helps identify reward computation bottlenecks during training. **Changes:** 1. Added `compute_score: float = 0.0` field to `AgentLoopMetrics` 2. Instrumented `_compute_score()` with `simple_timer` to measure reward computation time per sample 3. Added `agent_loop/compute_score/min|max|mean` and `agent_loop/slowest/compute_score` to `_performance_metrics` aggregation This follows the same pattern as the existing `generate_sequences` and `tool_calls` timing metrics. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/pulls?q=is%3Apr+compute_score+metrics - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) ### Test The change is backward-compatible: - `AgentLoopMetrics.compute_score` defaults to `0.0`, so existing agent loops that do not use async reward will report `0.0` without breaking. - When `reward_loop_worker_handles` is not `None`, `_compute_score` measures the full reward computation call and writes the elapsed time into `output.metrics.compute_score`. - The `_performance_metrics` method safely aggregates `compute_score` from all samples, consistent with how `generate_sequences` and `tool_calls` are handled. ### API and Usage Example No API changes. The new metric is automatically reported in the training logs alongside existing metrics: ``` agent_loop/compute_score/min: 0.12 agent_loop/compute_score/max: 2.34 agent_loop/compute_score/mean: 0.78 agent_loop/slowest/compute_score: 2.34 ``` ### Design & Code Changes - `verl/experimental/agent_loop/agent_loop.py`: - `AgentLoopMetrics`: added `compute_score` field - `AgentLoopWorker._compute_score()`: wrapped reward computation with `simple_timer` - `AgentLoopManager._performance_metrics()`: added min/max/mean/slowest aggregation for `compute_score` ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: The timing metric follows the exact same pattern as existing `generate_sequences` and `tool_calls` metrics. Testing requires GPU + reward model setup which is covered by existing integration tests in `tests/experimental/reward_loop/`. - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
1 parent fba0939 commit 93f1831

File tree

1 file changed

+35
-25
lines changed

1 file changed

+35
-25
lines changed

verl/experimental/agent_loop/agent_loop.py

Lines changed: 35 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@
4040
from verl.utils.config import omega_conf_to_dataclass
4141
from verl.utils.dataset.rl_dataset import RLHFDataset, get_dataset_class
4242
from verl.utils.model import compute_position_id_with_mask
43+
from verl.utils.profiler import simple_timer
4344
from verl.utils.ray_utils import auto_await, get_event_loop
4445
from verl.utils.rollout_trace import (
4546
RolloutTraceConfig,
@@ -180,6 +181,7 @@ class AgentLoopMetrics(BaseModel):
180181

181182
generate_sequences: float = 0.0
182183
tool_calls: float = 0.0
184+
compute_score: float = 0.0
183185
num_preempted: int = -1 # -1 means not available
184186

185187

@@ -857,30 +859,33 @@ async def _compute_score(self, output, prompts, responses, attention_mask, input
857859
enable_async_reward = self.reward_loop_worker_handles is not None
858860

859861
if output.reward_score is None and enable_async_reward:
860-
batch = TensorDict(
861-
{
862-
"prompts": prompts, # [1, prompt_length]
863-
"responses": responses, # [1, response_length]
864-
"attention_mask": attention_mask, # [1, prompt_length + response_length]
865-
"input_ids": input_ids, # [1, prompt_length + response_length]
866-
"position_ids": position_ids,
867-
},
868-
batch_size=1,
869-
)
870-
non_tensor_batch = {
871-
**{k: np.array([v]) for k, v in kwargs.items()},
872-
"__num_turns__": np.array([output.num_turns]),
873-
"tool_extra_fields": np.array([output.extra_fields], dtype=object),
874-
}
875-
876-
data = DataProto(
877-
batch=batch,
878-
non_tensor_batch=non_tensor_batch,
879-
)
880-
selected_reward_loop_worker_handle = random.choice(self.reward_loop_worker_handles)
881-
result = await selected_reward_loop_worker_handle.compute_score.remote(data)
882-
output.reward_score = result["reward_score"]
883-
output.extra_fields["reward_extra_info"] = result["reward_extra_info"]
862+
timing = {}
863+
with simple_timer("compute_score", timing):
864+
batch = TensorDict(
865+
{
866+
"prompts": prompts, # [1, prompt_length]
867+
"responses": responses, # [1, response_length]
868+
"attention_mask": attention_mask, # [1, prompt_length + response_length]
869+
"input_ids": input_ids, # [1, prompt_length + response_length]
870+
"position_ids": position_ids,
871+
},
872+
batch_size=1,
873+
)
874+
non_tensor_batch = {
875+
**{k: np.array([v]) for k, v in kwargs.items()},
876+
"__num_turns__": np.array([output.num_turns]),
877+
"tool_extra_fields": np.array([output.extra_fields], dtype=object),
878+
}
879+
880+
data = DataProto(
881+
batch=batch,
882+
non_tensor_batch=non_tensor_batch,
883+
)
884+
selected_reward_loop_worker_handle = random.choice(self.reward_loop_worker_handles)
885+
result = await selected_reward_loop_worker_handle.compute_score.remote(data)
886+
output.reward_score = result["reward_score"]
887+
output.extra_fields["reward_extra_info"] = result["reward_extra_info"]
888+
output.metrics.compute_score = timing["compute_score"]
884889

885890
async def _compute_teacher_logprobs(self, output: AgentLoopOutput, prompt_ids, response_ids, validate):
886891
"""Compute teacher logprobs for single sample."""
@@ -1200,6 +1205,7 @@ def _performance_metrics(self, metrics: list[list[dict[str, str]]], output: Data
12001205
timing = {}
12011206
t_generate_sequences = np.array([metric["generate_sequences"] for chunk in metrics for metric in chunk])
12021207
t_tool_calls = np.array([metric["tool_calls"] for chunk in metrics for metric in chunk])
1208+
t_compute_score = np.array([metric["compute_score"] for chunk in metrics for metric in chunk])
12031209
num_preempted = np.array([metric["num_preempted"] for chunk in metrics for metric in chunk])
12041210
timing["agent_loop/num_preempted/min"] = num_preempted.min()
12051211
timing["agent_loop/num_preempted/max"] = num_preempted.max()
@@ -1210,12 +1216,16 @@ def _performance_metrics(self, metrics: list[list[dict[str, str]]], output: Data
12101216
timing["agent_loop/tool_calls/min"] = t_tool_calls.min()
12111217
timing["agent_loop/tool_calls/max"] = t_tool_calls.max()
12121218
timing["agent_loop/tool_calls/mean"] = t_tool_calls.mean()
1219+
timing["agent_loop/compute_score/min"] = t_compute_score.min()
1220+
timing["agent_loop/compute_score/max"] = t_compute_score.max()
1221+
timing["agent_loop/compute_score/mean"] = t_compute_score.mean()
12131222

12141223
# batch sequence generation is bounded by the slowest sample
1215-
slowest = np.argmax(t_generate_sequences + t_tool_calls)
1224+
slowest = np.argmax(t_generate_sequences + t_tool_calls + t_compute_score)
12161225
prompt_length = output.batch["prompts"].shape[1]
12171226
timing["agent_loop/slowest/generate_sequences"] = t_generate_sequences[slowest]
12181227
timing["agent_loop/slowest/tool_calls"] = t_tool_calls[slowest]
1228+
timing["agent_loop/slowest/compute_score"] = t_compute_score[slowest]
12191229
timing["agent_loop/slowest/num_preempted"] = num_preempted[slowest]
12201230

12211231
if "attention_mask" in output.batch:

0 commit comments

Comments
 (0)