Skip to content

Commit b178a3c

Browse files
authored
[sglang] feat: add NPU GRPO training scripts for Qwen2.5-32B (FSDP/SGLang backends) (verl-project#5062)
### What does this PR do? add NPU GRPO training scripts for Qwen2.5-32B (FSDP/SGLang backends). The reward curves of this scenario are also shown. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. <img width="1672" height="965" alt="64b907c5ae7342249588ee2f42a461b0" src="https://github.com/user-attachments/assets/3cf7379e-31dc-4113-8398-ad0381744468" /> <img width="1668" height="962" alt="6a1371943e3847e4b0435c64fd6866da" src="https://github.com/user-attachments/assets/5d2bf9ad-8729-4e1e-9e11-0cf3b46fd47e" /> <img width="1667" height="958" alt="9cf5a7b8f2624822a53ba6b3d6df775b" src="https://github.com/user-attachments/assets/fa285df8-5d7d-4737-b5f6-64f0ee66a8e7" /> ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
1 parent cc7e283 commit b178a3c

File tree

1 file changed

+182
-0
lines changed

1 file changed

+182
-0
lines changed
Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
#!/bin/bash
2+
set -xeuo pipefail
3+
mkdir -p logs
4+
5+
# Project Configuration
6+
project_name='GRPO-Qwen2.5-32B-BASE-SGLang'
7+
exp_name='GRPO-Qwen2.5-32B-BASE-FSDP-SGLang'
8+
9+
# Necessary env
10+
export HCCL_CONNECT_TIMEOUT=1500
11+
export HCCL_HOST_SOCKET_PORT_RANGE=60000-60050
12+
export HCCL_NPU_SOCKET_PORT_RANGE=61000-61050
13+
14+
export RAY_EXPERIMENTAL_NOSET_ASCEND_RT_VISIBLE_DEVICES=1
15+
# If the number of nodes is 16, ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
16+
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
17+
18+
export DISABLE_L2_CACHE=1
19+
export TASK_QUEUE_ENABLE=1
20+
21+
# Node Info
22+
NNODES=${NNODES:-2}
23+
NPUS_PER_NODE=${NPUS_PER_NODE:-8}
24+
25+
# Model Weights Paths
26+
MODEL_PATH=Qwen/Qwen2.5-32B
27+
RAY_DATA_HOME=${RAY_DATA_HOME:-"${HOME}/verl"}
28+
CKPTS_DIR=${CKPTS_DIR:-"${RAY_DATA_HOME}/ckpts/${project_name}/${exp_name}"}
29+
30+
# File System Paths
31+
TRAIN_FILE=${TRAIN_FILE:-"${RAY_DATA_HOME}/datasets/deepscaler/train.parquet"}
32+
TEST_FILE=${TEST_FILE:-"${RAY_DATA_HOME}/datasets/deepscaler/test.parquet"}
33+
34+
# Data Configuration
35+
max_prompt_length=$((1024 * 2))
36+
max_response_length=$((1024 * 8))
37+
38+
# Training Batch Configuration
39+
train_prompt_bsz=32
40+
train_prompt_mini_bsz=32
41+
n_resp_per_prompt=8
42+
43+
# Algorithm Configuration
44+
adv_estimator=grpo
45+
use_kl_in_reward=False
46+
kl_coef=0.0
47+
use_kl_loss=True
48+
kl_loss_coef=0.001
49+
50+
# Performance and Memory Management Configuration
51+
all_offload=True
52+
use_dynamic_bsz=False
53+
54+
# SGLang Configuration
55+
gen_tp=4
56+
gen_sp=1
57+
gen_dp=1
58+
gen_ep=1
59+
gpu_memory_utilization=0.5
60+
61+
# Data Configuration
62+
DATA_CONFIG=(
63+
# File Paths
64+
data.train_files="${TRAIN_FILE}"
65+
data.val_files="${TEST_FILE}"
66+
# Data Structure
67+
data.prompt_key=prompt
68+
# Batch and Length Configuration
69+
data.train_batch_size=${train_prompt_bsz}
70+
data.max_prompt_length=${max_prompt_length}
71+
data.max_response_length=${max_response_length}
72+
# Preprocessing
73+
data.filter_overlong_prompts=False
74+
data.truncation='left'
75+
)
76+
77+
# Model Configuration
78+
MODEL_CONFIG=(
79+
# Model Path
80+
actor_rollout_ref.model.path="${MODEL_PATH}"
81+
# Model Processing
82+
actor_rollout_ref.model.use_remove_padding=True
83+
actor_rollout_ref.model.enable_gradient_checkpointing=True
84+
)
85+
86+
# Reinforcement Learning Algorithm Configuration
87+
ALGORITHM_CONFIG=(
88+
# Advantage Estimation
89+
algorithm.adv_estimator=${adv_estimator}
90+
# KL Divergence Control
91+
algorithm.use_kl_in_reward=${use_kl_in_reward}
92+
)
93+
94+
# Actor Model Configuration
95+
ACTOR_CONFIG=(
96+
# Core Runtime Settings
97+
actor_rollout_ref.actor.use_torch_compile=False
98+
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz}
99+
# Loss Function Configuration
100+
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss}
101+
actor_rollout_ref.actor.kl_loss_type=low_var_kl
102+
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef}
103+
actor_rollout_ref.actor.entropy_coeff=0
104+
# PPO Training Parameters
105+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
106+
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz}
107+
# Optimizer Settings
108+
actor_rollout_ref.actor.optim.lr=1e-6
109+
actor_rollout_ref.actor.fsdp_config.param_offload=${all_offload}
110+
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${all_offload}
111+
)
112+
113+
# Reference Model Configuration
114+
REF_CONFIG=(
115+
# Core Runtime Settings
116+
actor_rollout_ref.ref.use_torch_compile=False
117+
# Log Probability Inference
118+
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1
119+
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
120+
# Memory Optimization
121+
actor_rollout_ref.ref.fsdp_config.param_offload=${all_offload}
122+
)
123+
124+
# Rollout Configuration
125+
ROLLOUT_CONFIG=(
126+
# Rollout Engine
127+
actor_rollout_ref.rollout.name=sglang
128+
+actor_rollout_ref.rollout.engine_kwargs.sglang.attention_backend="ascend"
129+
# Generation Parameters
130+
actor_rollout_ref.rollout.n=${n_resp_per_prompt}
131+
actor_rollout_ref.rollout.top_p=1.0
132+
actor_rollout_ref.rollout.top_k=-1
133+
actor_rollout_ref.rollout.temperature=1.0
134+
# Log Probability Inference
135+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1
136+
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz}
137+
# Memory Management
138+
actor_rollout_ref.rollout.gpu_memory_utilization=${gpu_memory_utilization}
139+
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp}
140+
actor_rollout_ref.rollout.data_parallel_size=${gen_dp}
141+
actor_rollout_ref.rollout.expert_parallel_size=${gen_ep}
142+
actor_rollout_ref.rollout.enable_chunked_prefill=False
143+
actor_rollout_ref.rollout.multi_stage_wake_up=True
144+
# Validation Generation
145+
actor_rollout_ref.rollout.val_kwargs.n=1
146+
actor_rollout_ref.rollout.val_kwargs.do_sample=True
147+
actor_rollout_ref.rollout.val_kwargs.top_p=1.0
148+
actor_rollout_ref.rollout.val_kwargs.top_k=-1
149+
actor_rollout_ref.rollout.val_kwargs.temperature=1.0
150+
actor_rollout_ref.nccl_timeout=1800
151+
)
152+
153+
# Trainer Configuration
154+
TRAINER_CONFIG=(
155+
trainer.logger='["console"]'
156+
trainer.project_name="${project_name}"
157+
trainer.experiment_name="${exp_name}"
158+
trainer.nnodes="${NNODES}"
159+
trainer.n_gpus_per_node="${NPUS_PER_NODE}"
160+
trainer.total_epochs=5
161+
trainer.val_before_train=False
162+
trainer.test_freq=-1
163+
trainer.save_freq=100
164+
trainer.default_local_dir="${CKPTS_DIR}"
165+
trainer.critic_warmup=0
166+
)
167+
168+
# Main GRPO Training Command
169+
# Add the reward function processing for the DeepScaler dataset here
170+
python3 -m verl.trainer.main_ppo \
171+
--config-path=config \
172+
--config-name='ppo_trainer.yaml' \
173+
custom_reward_function.path=recipe/r1_ascend/deepscaler.py \
174+
custom_reward_function.name=compute_score \
175+
"${DATA_CONFIG[@]}" \
176+
"${MODEL_CONFIG[@]}" \
177+
"${ACTOR_CONFIG[@]}" \
178+
"${REF_CONFIG[@]}" \
179+
"${ROLLOUT_CONFIG[@]}" \
180+
"${ALGORITHM_CONFIG[@]}" \
181+
"${TRAINER_CONFIG[@]}" \
182+
"$@" | tee logs/run_qwen2_5-32b_grpo_fsdp_sglang_npu.log

0 commit comments

Comments
 (0)