[doc] feat: add Claude Code skills for add-dataset, add-reward, add-trainer by khazic · Pull Request #5844 · verl-project/verl

khazic · 2026-04-01T09:31:32Z

What does this PR do?

Adds three Claude Code skills for the most common veRL contribution patterns. Split out from #5843 per review from @tongyx361.

Each file under .agents/skills/ is read by Claude Code when the user invokes the corresponding slash command. No runtime behavior is affected.

Skill	Slash command	Purpose
`add-dataset`	`/add-dataset`	Preprocess + integrate a new RL training dataset
`add-reward`	`/add-reward`	Implement a `compute_score` reward function
`add-trainer`	`/add-trainer`	Add a new algorithm / trainer recipe

Usage examples

`/add-dataset` — adding AQuA-RAT

Prompt: /add-dataset I want to add the openai/aqua_rat multiple-choice math dataset.

Claude generates examples/data_preprocess/aqua_rat.py:

data_source = "openai/aqua_rat"

def make_map_fn(split: str):
    def process_fn(example: dict, idx: int) -> dict:
        options_str = "\n".join(example["options"])
        content = f"{example['question']}\n\nOptions:\n{options_str}"
        return {
            "data_source": data_source,
            "prompt": [
                {"role": "system", "content": "Think step by step, then select the correct option (A-E)."},
                {"role": "user", "content": content},
            ],
            "ability": "math",
            "reward_model": {"style": "rule", "ground_truth": example["correct"]},
            "extra_info": {"split": split, "index": idx},
        }
    return process_fn

The skill correctly applied the required schema (data_source, prompt, reward_model.ground_truth) and matched the dataset field names.

`/add-reward` — multiple-choice answer extraction

Prompt: /add-reward for openai/aqua_rat — extract the chosen option letter (A–E).

Claude generates verl/utils/reward_score/aqua_rat.py:

import re

def compute_score(solution_str: str, ground_truth: str) -> float:
    matches = re.findall(r'\b([A-E])\b', solution_str.split("assistant")[-1])
    if matches and matches[-1] == ground_truth.strip().upper():
        return 1.0
    return 0.0

And registers it in verl/utils/reward_score/__init__.py:

elif data_source == "openai/aqua_rat":
    from . import aqua_rat
    res = aqua_rat.compute_score(solution_str, ground_truth)

The skill correctly followed the no-exceptions, return-float contract and matched the data_source key from the preprocessing step.

`/add-trainer` — GRPO with clipped advantages

Prompt: /add-trainer I want a GRPO variant that clips advantages to [-clip, clip] before the policy update.

Claude generates examples/grpo_clip_trainer/grpo_clip_trainer.py:

from verl.trainer.ppo.ray_trainer import RayPPOTrainer
from verl.trainer.ppo.core_algos import register_adv_est

@register_adv_est("grpo_clip")
def grpo_clip_estimator(token_level_rewards, response_mask, config, **kwargs):
    advantages, returns = _grpo_base(token_level_rewards, response_mask, config)
    clip = config.algorithm.get("advantage_clip", 5.0)
    advantages = advantages.clamp(-clip, clip)
    return advantages, returns

class GRPOClipTrainer(RayPPOTrainer):
    pass  # uses grpo_clip via config: algorithm.adv_estimator=grpo_clip

The skill correctly identified register_adv_est as the extension point and showed how to wire the config key through.

Checklist

No runtime changes — .agents/skills/ files only
Splits out part of [doc, misc] chore: add Claude Code skills and CLAUDE.md for AI-assisted development #5843
Usage examples verified against actual veRL code structure

Code Review

This pull request introduces three new skill guides for the veRL framework: adding datasets, adding reward functions, and adding trainers. These guides provide step-by-step instructions, schema requirements, and reference implementations for developers. The review feedback identifies several critical technical inaccuracies in the documentation that would lead to runtime errors or incorrect implementations if followed. Specifically, the feedback corrects a class name in the dataset guide, an incomplete function signature in the reward function template, and misleading instructions regarding how to override advantage computation and execute custom trainers.

gemini-code-assist · 2026-04-01T09:34:18Z

.agents/skills/add-dataset/SKILL.md

+
+1. **Preprocessing script** (`examples/data_preprocess/<name>.py`) — run once offline to
+   convert raw data into parquet files with a fixed schema
+2. **`RLDataset`** (`verl/utils/dataset/rl_dataset.py`) — runtime dataset class that


The class name in verl/utils/dataset/rl_dataset.py is RLHFDataset, not RLDataset. Referring to it as RLDataset in the skill instructions will likely cause the AI to generate incorrect import statements or class references when implementing new datasets or preprocessing scripts.

Suggested change

2. **`RLDataset`** (`verl/utils/dataset/rl_dataset.py`) — runtime dataset class that

2. **`RLHFDataset`** (`verl/utils/dataset/rl_dataset.py`) — runtime dataset class that

gemini-code-assist · 2026-04-01T09:34:18Z

.agents/skills/add-reward/SKILL.md

+```python
+from verl.utils.reward_score.<name> import compute_score as <name>_compute_score
+
+def default_compute_score(data_source, solution_str, ground_truth, extra_info=None):


The signature for default_compute_score provided in the template is incomplete. The actual implementation in verl/utils/reward_score/__init__.py includes several additional parameters (like sandbox_fusion_url, concurrent_semaphore, etc.) and a **kwargs catch-all. If the AI follows this template to replace the function header, it will cause a TypeError at runtime when the function is called with the full set of arguments by the RewardManager.

Suggested change

def default_compute_score(data_source, solution_str, ground_truth, extra_info=None):

def default_compute_score(data_source, solution_str, ground_truth, extra_info=None, **kwargs):

gemini-code-assist · 2026-04-01T09:34:18Z

.agents/skills/add-trainer/SKILL.md

+    def _compute_advantage(self, data: DataProto) -> DataProto:
+        """Override advantage computation for your algorithm."""
+        rewards = data.batch["token_level_scores"]  # shape: [bs, seqlen]
+        # ... your advantage computation
+        data.batch["advantages"] = advantages
+        data.batch["returns"] = returns
+        return data


The RayPPOTrainer class in verl/trainer/ppo/ray_trainer.py does not have a _compute_advantage method. Advantage computation is handled by a standalone compute_advantage function called within the fit method. Overriding _compute_advantage in a subclass as suggested here will have no effect on the training loop. To customize advantage estimation, the recommended approach is to register a new estimator function as described in Step 6 of this guide.

gemini-code-assist · 2026-04-01T09:34:18Z

.agents/skills/add-trainer/SKILL.md

+
+set -x
+
+python3 -m verl.trainer.main_ppo \


The run script example uses python3 -m verl.trainer.main_ppo, which defaults to using the standard RayPPOTrainer implementation. If a user implements a custom trainer class like MyTrainer (as suggested in Step 3), this command will not execute their custom logic. The guide should instead show how to create a custom entry point that instantiates and runs the new trainer class, or how to configure the system to use the custom class.

…ner extension points - add-dataset: RLDataset → RLHFDataset (actual class name in rl_dataset.py) - add-reward: add **kwargs to default_compute_score template signature - add-trainer: replace nonexistent _compute_advantage override with register_adv_est pattern - add-trainer: fix run script entry point — import custom module before calling main_ppo Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>

khazic · 2026-04-01T09:44:57Z

Summary

This PR adds three Claude Code skills that guide AI assistants through the most common veRL contribution patterns.

What each skill does:

— walks through creating a preprocessing script with the correct parquet schema, wiring , and pointing the trainer config at the output files
— covers implementing a function and registering it in
— explains the extension point for new advantage estimators, and when to subclass for structural changes instead

Improvements in the latest commit (based on Gemini review):

Corrected class name → to match the actual implementation
Fixed template signature to include , preventing a at runtime
Replaced the nonexistent override with the correct decorator pattern
Fixed the run script entry point — the standard does not load external modules automatically, so the custom estimator module must be imported first to trigger registration

docs: add Claude Code skills for add-dataset, add-reward, add-trainer

980b46d

Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc] feat: add Claude Code skills for add-dataset, add-reward, add-trainer#5844

[doc] feat: add Claude Code skills for add-dataset, add-reward, add-trainer#5844
khazic wants to merge 2 commits intoverl-project:mainfrom
khazic:feat/skills-add-dataset-reward-trainer

khazic commented Apr 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

khazic commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	2. `RLDataset` (`verl/utils/dataset/rl_dataset.py`) — runtime dataset class that
	2. `RLHFDataset` (`verl/utils/dataset/rl_dataset.py`) — runtime dataset class that

	def default_compute_score(data_source, solution_str, ground_truth, extra_info=None):
	def default_compute_score(data_source, solution_str, ground_truth, extra_info=None, **kwargs):

Conversation

khazic commented Apr 1, 2026

What does this PR do?

Usage examples

/add-dataset — adding AQuA-RAT

/add-reward — multiple-choice answer extraction

/add-trainer — GRPO with clipped advantages

Checklist

Related

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

khazic commented Apr 1, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`/add-dataset` — adding AQuA-RAT

`/add-reward` — multiple-choice answer extraction

`/add-trainer` — GRPO with clipped advantages