Skip to content

[doc] feat: add Claude Code skills for add-dataset, add-reward, add-trainer#5844

Open
khazic wants to merge 2 commits intoverl-project:mainfrom
khazic:feat/skills-add-dataset-reward-trainer
Open

[doc] feat: add Claude Code skills for add-dataset, add-reward, add-trainer#5844
khazic wants to merge 2 commits intoverl-project:mainfrom
khazic:feat/skills-add-dataset-reward-trainer

Conversation

@khazic
Copy link
Copy Markdown
Contributor

@khazic khazic commented Apr 1, 2026

What does this PR do?

Adds three Claude Code skills for the most common veRL contribution patterns. Split out from #5843 per review from @tongyx361.

Each file under .agents/skills/ is read by Claude Code when the user invokes the corresponding slash command. No runtime behavior is affected.

Skill Slash command Purpose
add-dataset /add-dataset Preprocess + integrate a new RL training dataset
add-reward /add-reward Implement a compute_score reward function
add-trainer /add-trainer Add a new algorithm / trainer recipe

Usage examples

/add-dataset — adding AQuA-RAT

Prompt: /add-dataset I want to add the openai/aqua_rat multiple-choice math dataset.

Claude generates examples/data_preprocess/aqua_rat.py:

data_source = "openai/aqua_rat"

def make_map_fn(split: str):
    def process_fn(example: dict, idx: int) -> dict:
        options_str = "\n".join(example["options"])
        content = f"{example['question']}\n\nOptions:\n{options_str}"
        return {
            "data_source": data_source,
            "prompt": [
                {"role": "system", "content": "Think step by step, then select the correct option (A-E)."},
                {"role": "user", "content": content},
            ],
            "ability": "math",
            "reward_model": {"style": "rule", "ground_truth": example["correct"]},
            "extra_info": {"split": split, "index": idx},
        }
    return process_fn

The skill correctly applied the required schema (data_source, prompt, reward_model.ground_truth) and matched the dataset field names.


/add-reward — multiple-choice answer extraction

Prompt: /add-reward for openai/aqua_rat — extract the chosen option letter (A–E).

Claude generates verl/utils/reward_score/aqua_rat.py:

import re

def compute_score(solution_str: str, ground_truth: str) -> float:
    matches = re.findall(r'\b([A-E])\b', solution_str.split("assistant")[-1])
    if matches and matches[-1] == ground_truth.strip().upper():
        return 1.0
    return 0.0

And registers it in verl/utils/reward_score/__init__.py:

elif data_source == "openai/aqua_rat":
    from . import aqua_rat
    res = aqua_rat.compute_score(solution_str, ground_truth)

The skill correctly followed the no-exceptions, return-float contract and matched the data_source key from the preprocessing step.


/add-trainer — GRPO with clipped advantages

Prompt: /add-trainer I want a GRPO variant that clips advantages to [-clip, clip] before the policy update.

Claude generates examples/grpo_clip_trainer/grpo_clip_trainer.py:

from verl.trainer.ppo.ray_trainer import RayPPOTrainer
from verl.trainer.ppo.core_algos import register_adv_est

@register_adv_est("grpo_clip")
def grpo_clip_estimator(token_level_rewards, response_mask, config, **kwargs):
    advantages, returns = _grpo_base(token_level_rewards, response_mask, config)
    clip = config.algorithm.get("advantage_clip", 5.0)
    advantages = advantages.clamp(-clip, clip)
    return advantages, returns

class GRPOClipTrainer(RayPPOTrainer):
    pass  # uses grpo_clip via config: algorithm.adv_estimator=grpo_clip

The skill correctly identified register_adv_est as the extension point and showed how to wire the config key through.


Checklist

Related

  • Remaining skills will follow in separate PRs: add-unit-tests, review-pr, create-pr, commit-conventions, debug-distributed, upgrade skills

Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces three new skill guides for the veRL framework: adding datasets, adding reward functions, and adding trainers. These guides provide step-by-step instructions, schema requirements, and reference implementations for developers. The review feedback identifies several critical technical inaccuracies in the documentation that would lead to runtime errors or incorrect implementations if followed. Specifically, the feedback corrects a class name in the dataset guide, an incomplete function signature in the reward function template, and misleading instructions regarding how to override advantage computation and execute custom trainers.


1. **Preprocessing script** (`examples/data_preprocess/<name>.py`) — run once offline to
convert raw data into parquet files with a fixed schema
2. **`RLDataset`** (`verl/utils/dataset/rl_dataset.py`) — runtime dataset class that
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The class name in verl/utils/dataset/rl_dataset.py is RLHFDataset, not RLDataset. Referring to it as RLDataset in the skill instructions will likely cause the AI to generate incorrect import statements or class references when implementing new datasets or preprocessing scripts.

Suggested change
2. **`RLDataset`** (`verl/utils/dataset/rl_dataset.py`) — runtime dataset class that
2. **`RLHFDataset`** (`verl/utils/dataset/rl_dataset.py`) — runtime dataset class that

```python
from verl.utils.reward_score.<name> import compute_score as <name>_compute_score

def default_compute_score(data_source, solution_str, ground_truth, extra_info=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The signature for default_compute_score provided in the template is incomplete. The actual implementation in verl/utils/reward_score/__init__.py includes several additional parameters (like sandbox_fusion_url, concurrent_semaphore, etc.) and a **kwargs catch-all. If the AI follows this template to replace the function header, it will cause a TypeError at runtime when the function is called with the full set of arguments by the RewardManager.

Suggested change
def default_compute_score(data_source, solution_str, ground_truth, extra_info=None):
def default_compute_score(data_source, solution_str, ground_truth, extra_info=None, **kwargs):

Comment on lines +75 to +81
def _compute_advantage(self, data: DataProto) -> DataProto:
"""Override advantage computation for your algorithm."""
rewards = data.batch["token_level_scores"] # shape: [bs, seqlen]
# ... your advantage computation
data.batch["advantages"] = advantages
data.batch["returns"] = returns
return data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The RayPPOTrainer class in verl/trainer/ppo/ray_trainer.py does not have a _compute_advantage method. Advantage computation is handled by a standalone compute_advantage function called within the fit method. Overriding _compute_advantage in a subclass as suggested here will have no effect on the training loop. To customize advantage estimation, the recommended approach is to register a new estimator function as described in Step 6 of this guide.


set -x

python3 -m verl.trainer.main_ppo \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The run script example uses python3 -m verl.trainer.main_ppo, which defaults to using the standard RayPPOTrainer implementation. If a user implements a custom trainer class like MyTrainer (as suggested in Step 3), this command will not execute their custom logic. The guide should instead show how to create a custom entry point that instantiates and runs the new trainer class, or how to configure the system to use the custom class.

…ner extension points

- add-dataset: RLDataset → RLHFDataset (actual class name in rl_dataset.py)
- add-reward: add **kwargs to default_compute_score template signature
- add-trainer: replace nonexistent _compute_advantage override with register_adv_est pattern
- add-trainer: fix run script entry point — import custom module before calling main_ppo

Co-authored-by: Shawn/Yuxuan Tong <tongyuxuan361@gmail.com>
@khazic
Copy link
Copy Markdown
Contributor Author

khazic commented Apr 1, 2026

Summary

This PR adds three Claude Code skills that guide AI assistants through the most common veRL contribution patterns.

What each skill does:

  • — walks through creating a preprocessing script with the correct parquet schema, wiring , and pointing the trainer config at the output files
  • — covers implementing a function and registering it in
  • — explains the extension point for new advantage estimators, and when to subclass for structural changes instead

Improvements in the latest commit (based on Gemini review):

  • Corrected class name → to match the actual implementation
  • Fixed template signature to include , preventing a at runtime
  • Replaced the nonexistent override with the correct decorator pattern
  • Fixed the run script entry point — the standard does not load external modules automatically, so the custom estimator module must be imported first to trigger registration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant