Skip to content

Commit 7290aef

Browse files
committed
[ci] chore: migrate all rm related ci to reward loop (verl-project#4520)
### What does this PR do? - Migrate all Reward-Model-related CI to Reward Loop (verified) - Set the naive router as the default for the reward loop ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
1 parent f7c90d6 commit 7290aef

24 files changed

+222
-102
lines changed

examples/grpo_trainer/run_mistral13b_skyworkrm_hhrlhf.sh

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,14 @@ python3 -m verl.trainer.main_ppo \
3434
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
3535
actor_rollout_ref.model.enable_gradient_checkpointing=True \
3636
reward_model.enable=True \
37-
reward_model.model.fsdp_config.param_offload=True \
3837
reward_model.model.path=Skywork/Skywork-Reward-Llama-3.1-8B \
39-
reward_model.model.input_tokenizer=mistralai/Mistral-Nemo-Instruct-2407 \
40-
reward_model.micro_batch_size_per_gpu=4 \
38+
reward_model.use_reward_loop=True \
39+
reward_model.rollout.name=vllm \
40+
reward_model.rollout.gpu_memory_utilization=0.8 \
41+
reward_model.rollout.tensor_model_parallel_size=1 \
42+
reward_model.rollout.prompt_length=8192 \
43+
reward_model.rollout.response_length=4096 \
44+
reward_model.num_workers=8 \
4145
algorithm.use_kl_in_reward=False \
4246
trainer.logger='["console","wandb"]' \
4347
trainer.val_before_train=False \

examples/ppo_trainer/run_deepseek_full_hh_rlhf.sh

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,14 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
2525
critic.model.path=deepseek-ai/deepseek-llm-7b-chat \
2626
critic.ppo_micro_batch_size_per_gpu=4 \
2727
reward_model.enable=True \
28-
reward_model.megatron.tensor_model_parallel_size=4 \
2928
reward_model.model.path=deepseek-ai/deepseek-llm-7b-chat \
30-
reward_model.micro_batch_size_per_gpu=4 \
31-
reward_model.param_offload=False \
29+
reward_model.use_reward_loop=True \
30+
reward_model.rollout.name=vllm \
31+
reward_model.rollout.gpu_memory_utilization=0.8 \
32+
reward_model.rollout.tensor_model_parallel_size=4 \
33+
reward_model.rollout.prompt_length=256 \
34+
reward_model.rollout.response_length=128 \
35+
reward_model.num_workers=8 \
3236
algorithm.use_kl_in_reward=False \
3337
trainer.critic_warmup=0 \
3438
trainer.logger='["console","wandb"]' \

examples/ppo_trainer/run_qwen2-7b_rm.sh

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -55,9 +55,13 @@ python3 -m verl.trainer.main_ppo \
5555
critic.model.fsdp_config.optimizer_offload=False \
5656
reward_model.enable=True \
5757
reward_model.model.path="$HOME/models/FsfairX-LLaMA3-RM-v0.1" \
58-
reward_model.model.use_remove_padding=True \
59-
reward_model.model.fsdp_config.param_offload=True \
60-
reward_model.micro_batch_size_per_gpu=32 \
58+
reward_model.use_reward_loop=True \
59+
reward_model.rollout.name=vllm \
60+
reward_model.rollout.gpu_memory_utilization=0.8 \
61+
reward_model.rollout.tensor_model_parallel_size=1 \
62+
reward_model.rollout.prompt_length=2048 \
63+
reward_model.rollout.response_length=1024 \
64+
reward_model.num_workers=8 \
6165
algorithm.use_kl_in_reward=False \
6266
trainer.critic_warmup=0 \
6367
trainer.logger='["console","wandb"]' \

examples/ppo_trainer/run_qwen2-7b_rm_seq_balance.sh

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,13 @@ python3 -m verl.trainer.main_ppo \
4242
critic.model.fsdp_config.optimizer_offload=False \
4343
reward_model.enable=True \
4444
reward_model.model.path=sfairXC/FsfairX-LLaMA3-RM-v0.1\
45-
reward_model.model.use_remove_padding=True \
46-
reward_model.model.fsdp_config.param_offload=True \
47-
reward_model.micro_batch_size_per_gpu=32 \
48-
reward_model.use_dynamic_bsz=True \
49-
reward_model.forward_max_token_len_per_gpu=98304 \
45+
reward_model.use_reward_loop=True \
46+
reward_model.rollout.name=vllm \
47+
reward_model.rollout.gpu_memory_utilization=0.8 \
48+
reward_model.rollout.tensor_model_parallel_size=1 \
49+
reward_model.rollout.prompt_length=8192 \
50+
reward_model.rollout.response_length=4096 \
51+
reward_model.num_workers=8 \
5052
algorithm.use_kl_in_reward=False \
5153
trainer.critic_warmup=0 \
5254
trainer.logger='["console","wandb"]' \

examples/ppo_trainer/run_qwen2-7b_rm_seq_balance_fused_kernels.sh

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,12 +45,14 @@ python3 -m verl.trainer.main_ppo \
4545
critic.model.fsdp_config.param_offload=False \
4646
critic.model.fsdp_config.optimizer_offload=False \
4747
reward_model.enable=True \
48-
reward_model.model.path=sfairXC/FsfairX-LLaMA3-RM-v0.1\
49-
reward_model.model.use_remove_padding=True \
50-
reward_model.model.fsdp_config.param_offload=True \
51-
reward_model.micro_batch_size_per_gpu=32 \
52-
reward_model.use_dynamic_bsz=True \
53-
reward_model.forward_max_token_len_per_gpu=98304 \
48+
reward_model.model.path=sfairXC/FsfairX-LLaMA3-RM-v0.1 \
49+
reward_model.use_reward_loop=True \
50+
reward_model.rollout.name=vllm \
51+
reward_model.rollout.gpu_memory_utilization=0.8 \
52+
reward_model.rollout.tensor_model_parallel_size=1 \
53+
reward_model.rollout.prompt_length=8192 \
54+
reward_model.rollout.response_length=4096 \
55+
reward_model.num_workers=8 \
5456
algorithm.use_kl_in_reward=False \
5557
trainer.critic_warmup=0 \
5658
trainer.logger='["console","wandb"]' \

examples/ppo_trainer/run_qwen2-7b_rm_seq_balance_nsys.sh

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -55,14 +55,13 @@ python3 -m verl.trainer.main_ppo \
5555
critic.profiler.all_ranks=$PROFILE_RANKS_ALL \
5656
reward_model.enable=True \
5757
reward_model.model.path=sfairXC/FsfairX-LLaMA3-RM-v0.1\
58-
reward_model.model.use_remove_padding=True \
59-
reward_model.model.fsdp_config.param_offload=True \
60-
reward_model.micro_batch_size_per_gpu=32 \
61-
reward_model.use_dynamic_bsz=True \
62-
reward_model.forward_max_token_len_per_gpu=98304 \
63-
reward_model.profiler.enable=True \
64-
reward_model.profiler.ranks=$PROFILE_RANKS \
65-
reward_model.profiler.all_ranks=$PROFILE_RANKS_ALL \
58+
reward_model.use_reward_loop=True \
59+
reward_model.rollout.name=vllm \
60+
reward_model.rollout.gpu_memory_utilization=0.8 \
61+
reward_model.rollout.tensor_model_parallel_size=1 \
62+
reward_model.rollout.prompt_length=8192 \
63+
reward_model.rollout.response_length=4096 \
64+
reward_model.num_workers=8 \
6665
algorithm.use_kl_in_reward=False \
6766
trainer.critic_warmup=0 \
6867
trainer.logger='["console","wandb"]' \
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# download datasets and models
2+
# python3 examples/data_preprocess/gsm8k.py
3+
# python3 examples/data_preprocess/math_dataset.py
4+
# huggingface-cli download Skywork/Skywork-Reward-V2-Llama-3.2-3B --local-dir $HOME/models/Skywork-Reward-V2-Llama-3.2-3B
5+
# huggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir $HOME/models/Qwen2.5-3B-Instruct
6+
7+
gsm8k_train_path=$HOME/data/gsm8k/train.parquet
8+
gsm8k_test_path=$HOME/data/gsm8k/test.parquet
9+
math_train_path=$HOME/data/math/train.parquet
10+
math_test_path=$HOME/data/math/test.parquet
11+
12+
train_files="['$gsm8k_train_path', '$math_train_path']"
13+
test_files="['$gsm8k_test_path', '$math_test_path']"
14+
15+
python3 -m verl.trainer.main_ppo \
16+
algorithm.adv_estimator=gae \
17+
data.train_files="$train_files" \
18+
data.val_files="$test_files" \
19+
data.train_batch_size=1024 \
20+
data.max_prompt_length=1024 \
21+
data.max_response_length=2048 \
22+
data.filter_overlong_prompts=True \
23+
data.truncation='error' \
24+
data.return_raw_chat=True \
25+
actor_rollout_ref.model.path="$HOME/models/Qwen2.5-3B-Instruct" \
26+
actor_rollout_ref.actor.optim.lr=1e-6 \
27+
actor_rollout_ref.model.use_remove_padding=True \
28+
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1 \
29+
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
30+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
31+
actor_rollout_ref.actor.use_kl_loss=False \
32+
actor_rollout_ref.model.enable_gradient_checkpointing=True \
33+
actor_rollout_ref.actor.fsdp_config.param_offload=False \
34+
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
35+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
36+
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
37+
actor_rollout_ref.rollout.name=vllm \
38+
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
39+
critic.optim.lr=1e-5 \
40+
critic.model.use_remove_padding=True \
41+
critic.optim.lr_warmup_steps_ratio=0.05 \
42+
critic.model.path="$HOME/models/Qwen2.5-3B-Instruct" \
43+
critic.model.enable_gradient_checkpointing=True \
44+
critic.ppo_micro_batch_size_per_gpu=32 \
45+
critic.model.fsdp_config.param_offload=False \
46+
critic.model.fsdp_config.optimizer_offload=False \
47+
reward_model.enable=True \
48+
reward_model.model.path="$HOME/models/Skywork-Reward-V2-Llama-3.2-3B" \
49+
reward_model.use_reward_loop=False \
50+
reward_model.model.use_remove_padding=True \
51+
reward_model.model.fsdp_config.param_offload=True \
52+
reward_model.micro_batch_size_per_gpu=32 \
53+
algorithm.use_kl_in_reward=False \
54+
trainer.critic_warmup=0 \
55+
trainer.logger='["console","wandb"]' \
56+
trainer.project_name='verl_test_qwen25_rm' \
57+
trainer.val_before_train=True \
58+
trainer.experiment_name='legacy_fsdp_reward_model' \
59+
trainer.n_gpus_per_node=8 \
60+
trainer.nnodes=1 \
61+
trainer.save_freq=-1 \
62+
trainer.test_freq=10 \
63+
trainer.total_epochs=15 $@
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# download datasets and models
2+
# python3 examples/data_preprocess/gsm8k.py
3+
# python3 examples/data_preprocess/math_dataset.py
4+
# huggingface-cli download Skywork/Skywork-Reward-V2-Llama-3.2-3B --local-dir $HOME/models/Skywork-Reward-V2-Llama-3.2-3B
5+
# huggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir $HOME/models/Qwen2.5-3B-Instruct
6+
7+
gsm8k_train_path=$HOME/data/gsm8k/train.parquet
8+
gsm8k_test_path=$HOME/data/gsm8k/test.parquet
9+
math_train_path=$HOME/data/math/train.parquet
10+
math_test_path=$HOME/data/math/test.parquet
11+
12+
train_files="['$gsm8k_train_path', '$math_train_path']"
13+
test_files="['$gsm8k_test_path', '$math_test_path']"
14+
15+
python3 -m verl.trainer.main_ppo \
16+
algorithm.adv_estimator=gae \
17+
data.train_files="$train_files" \
18+
data.val_files="$test_files" \
19+
data.train_batch_size=1024 \
20+
data.max_prompt_length=1024 \
21+
data.max_response_length=2048 \
22+
data.filter_overlong_prompts=True \
23+
data.truncation='error' \
24+
data.return_raw_chat=True \
25+
actor_rollout_ref.model.path="$HOME/models/Qwen2.5-3B-Instruct" \
26+
actor_rollout_ref.actor.optim.lr=1e-6 \
27+
actor_rollout_ref.model.use_remove_padding=True \
28+
actor_rollout_ref.actor.optim.lr_warmup_steps_ratio=0.1 \
29+
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
30+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
31+
actor_rollout_ref.actor.use_kl_loss=False \
32+
actor_rollout_ref.model.enable_gradient_checkpointing=True \
33+
actor_rollout_ref.actor.fsdp_config.param_offload=False \
34+
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
35+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
36+
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
37+
actor_rollout_ref.rollout.name=vllm \
38+
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
39+
critic.optim.lr=1e-5 \
40+
critic.model.use_remove_padding=True \
41+
critic.optim.lr_warmup_steps_ratio=0.05 \
42+
critic.model.path="$HOME/models/Qwen2.5-3B-Instruct" \
43+
critic.model.enable_gradient_checkpointing=True \
44+
critic.ppo_micro_batch_size_per_gpu=32 \
45+
critic.model.fsdp_config.param_offload=False \
46+
critic.model.fsdp_config.optimizer_offload=False \
47+
reward_model.enable=True \
48+
reward_model.model.path="$HOME/models/Skywork-Reward-V2-Llama-3.2-3B" \
49+
reward_model.use_reward_loop=True \
50+
reward_model.rollout.name=vllm \
51+
reward_model.rollout.gpu_memory_utilization=0.8 \
52+
reward_model.rollout.tensor_model_parallel_size=1 \
53+
reward_model.rollout.prompt_length=4096 \
54+
reward_model.rollout.response_length=4096 \
55+
reward_model.num_workers=8 \
56+
algorithm.use_kl_in_reward=False \
57+
trainer.critic_warmup=0 \
58+
trainer.logger='["console","wandb"]' \
59+
trainer.project_name='verl_test_qwen25_rm' \
60+
trainer.val_before_train=False \
61+
trainer.experiment_name='reward_loop_colocate_reward_model' \
62+
trainer.n_gpus_per_node=8 \
63+
trainer.nnodes=1 \
64+
trainer.save_freq=-1 \
65+
trainer.test_freq=10 \
66+
trainer.total_epochs=15 $@

recipe/fapo/README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,3 +78,12 @@ bash recipe/fapo/run_fapo_32b.sh # 32b fapo model
7878
We implement RewardLoop to enable efficient and flexible reward computation.
7979
The core implementation can be found in `verl/experimental/reward/`.
8080
Refer to [this official document](https://verl.readthedocs.io/en/latest/advance/reward_loop.html) for more implementation details.
81+
82+
```bibtex
83+
@article{ding2025fapo,
84+
title={FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning},
85+
author={Ding, Yuyang and Zhang, Chi and Li, Juntao and Lin, Haibin and Liu, Xin and Zhang, Min},
86+
journal={arXiv preprint arXiv:2510.22543},
87+
year={2025}
88+
}
89+
```

recipe/fapo/run_baseline_32b.sh

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,15 +53,10 @@ offload=True
5353
gen_tp=4
5454
fsdp_size=32
5555

56-
PROJECT_DIR="$(pwd)"
57-
CONFIG_PATH="$PROJECT_DIR/recipe/fapo/config"
58-
5956
ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
6057
--address "${RAY_ADDRESS}" \
6158
--working-dir "${WORKING_DIR}" \
6259
-- python3 -m verl.trainer.main_ppo \
63-
--config-path $CONFIG_PATH \
64-
--config-name rm_config.yaml \
6560
data.train_files="${TRAIN_FILE}" \
6661
data.val_files="${TEST_FILE}" \
6762
data.prompt_key=prompt \

0 commit comments

Comments
 (0)