Skip to content

Commit e9405d7

Browse files
[BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation (verl-project#4280)
### What does this PR do? Refactor vLLM co-located training-inference rollout from single-process to multi-process architecture. This refactoring separates training and inference into different processes, enabling better resource isolation and paving the way for future checkpoint-engine integration (in roadmap verl-project#3624). **Key Changes:** - Transform `vLLMAsyncRollout` into `ServerAdapter` - a client-side adapter that communicates with the inference executor - Remove `ExternalZeroMQDistributedExecutor` and use `MultiprocExecutor` as the inference backend - Implement CUDA IPC-based weight updates via ZeroMQ for efficient parameter synchronization between training and inference processes ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example This refactoring maintains full backward compatibility with existing vLLM rollout APIs. No changes are required to user code. **Key API Components:** * **ServerAdapter** (replaces `vLLMAsyncRollout`): - Acts as client-side adapter for communicating with inference executor - Manages CUDA IPC-based weight updates - Provides same interface as previous `vLLMAsyncRollout` class ### Design #### Architecture Overview 1. Before (Single-Process Architecture) * Single-Process Design In the original `AsyncActorRolloutRefWorker`, the training engine and inference engine shared the same process. The vLLM inference engine directly received weight updates through parameter passing. ![single](https://github.com/user-attachments/assets/c3ff858f-f33e-4eb7-98c5-083c5b679d62) * Communication Architecture `ExternalZeroMQDistributedExecutor` acts as a client, sending instructions to all `AsyncActorRolloutRefWorker` inference engines via ZMQ to execute operations like `init_worker`, `load_model`, `init_device`, and `generate`. Operations like `wake_up`, `sleep`, and weight updates were executed directly in `vLLMAsyncRollout` without going through `ExternalZeroMQDistributedExecutor`. ![single_comm](https://github.com/user-attachments/assets/2be913c0-9b87-4281-bac2-1460e946b702) 2. After (Multi-Process Architecture): * Multi-Process Design Transform `vLLMAsyncRollout` into `ServerAdapter`, serving as a client for communicating with the inference engine (AsyncLLM). Weight updates are based on CUDA IPC, passing through ZeroMQ to the inference engine. ![multi](https://github.com/user-attachments/assets/51102b97-f74b-4cda-8a56-5effd2c64539) * Communication Architecture Deprecate the original `ExternalZeroMQDistributedExecutor` class and directly use vllm's `MultiprocExecutor` by passing `distributed_executor_backend = "mp"`. All inference engine operations are uniformly broadcast to all inference workers through `MultiprocExecutor`'s RPC Broadcast MQ. ![multi_comm](https://github.com/user-attachments/assets/4a98cba4-89d0-432e-94dd-040a20877363) ### Convergence test - model: Qwen3-VL-30B-A3B-Instruct - dataset: geo3k - GPU: 4*8 H100 <img width="660" height="618" alt="image" src="https://github.com/user-attachments/assets/6e3e7dbd-03f9-471a-b8d5-bc0344dba299" /> ### Performance test: update weights - CUDA IPC bucket_size: 2GB - GPU: H100, ConnectX-7 400 Gbps (InfiniBand) | Model | #GPU | Parallelism | Time | |---|---|---|---| |Qwen3-VL-30B-A3B-Instruct|TP2,EP8|4*8|5s| |DeepSeek-V3.1-Terminus|TP8, PP16, EP8| 16*8 | 120s | |DeepSeek-V3.1-Terminus|TP16,PP16| 32*8 | 80s| ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com> Co-authored-by: wuxibin <wuxibin@bytedance.com>
1 parent f31df34 commit e9405d7

37 files changed

Lines changed: 527 additions & 520 deletions

.github/workflows/e2e_ascend.yml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -257,8 +257,9 @@ jobs:
257257
- name: Preprocess gsm8k dataset
258258
run: |
259259
python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
260-
- name: Running the E2E test with one_step_off_policy algorithm on ASCEND NPU (FSDP2)
261-
run: |
262-
ray stop --force
263-
bash tests/special_npu/run_one_step_off_policy.sh
264-
rm -rf $HOME/ckpts
260+
# TODO(wuxibin): temporary disable until we refactor with checkpoint engine
261+
# - name: Running the E2E test with one_step_off_policy algorithm on ASCEND NPU (FSDP2)
262+
# run: |
263+
# ray stop --force
264+
# bash tests/special_npu/run_one_step_off_policy.sh
265+
# rm -rf $HOME/ckpts

.github/workflows/npu_unit_tests.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# - `special_sanity`: a suite of quick sanity tests
1313
# - `special_standalone`: a set of test that are designed to run in dedicated environments
1414

15-
# Accelerators for tests
15+
# Accelerators for tests
1616
# - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
1717
# - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
1818

@@ -67,7 +67,7 @@ concurrency:
6767
cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
6868

6969
# Declare permissions just read content.
70-
permissions:
70+
permissions:
7171
contents: read
7272

7373
jobs:
@@ -109,7 +109,7 @@ jobs:
109109
- name: Run all NPU unit tests
110110
run: |
111111
export PYTHONPATH=$PYTHONPATH:/Megatron-LM
112-
pytest -s -x --ignore-glob="*test_special_*.py" --ignore-glob="*on_cpu.py" --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob="tests/special*" --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" --ignore-glob="*test_rvdz*" --ignore-glob="*test_ray_collectives*" --ignore-glob="*test_nvtx_profile*" --ignore-glob="*test_nccl*" --ignore-glob="*test_nixl*" tests/
112+
pytest -s -x --ignore-glob="*test_special_*.py" --ignore-glob="*on_cpu.py" --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob="tests/special*" --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" --ignore-glob="*test_rvdz*" --ignore-glob="*test_ray_collectives*" --ignore-glob="*test_nvtx_profile*" --ignore-glob="tests/checkpoint_engine" tests/
113113
- name: Testing FSDP2 actor functionality
114114
run: |
115115
torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/actor/test_special_dp_actor.py
@@ -118,4 +118,4 @@ jobs:
118118
torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/critic/test_special_dp_critic.py
119119
- name: Running NPU profiling unit tests
120120
run: |
121-
pytest -s -x tests/utils/test_special_mstx_profile.py
121+
pytest -s -x tests/utils/test_special_mstx_profile.py
File renamed without changes.
File renamed without changes.

examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,5 @@ python3 -m verl.trainer.main_ppo \
6363
data.train_files=$HOME/data/gsm8k/train.parquet \
6464
data.val_files=$HOME/data/gsm8k/test.parquet \
6565
actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \
66-
trainer.total_epochs=15 \
67-
actor_rollout_ref.rollout.update_weights_bucket_megabytes=512 $@
66+
trainer.total_epochs=15 $@
6867

examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn_server.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,5 @@ python3 -m verl.trainer.main_ppo \
5858
data.train_files=$HOME/data/gsm8k/train.parquet \
5959
data.val_files=$HOME/data/gsm8k/test.parquet \
6060
actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \
61-
trainer.total_epochs=15 \
62-
actor_rollout_ref.rollout.update_weights_bucket_megabytes=512 $@
61+
trainer.total_epochs=15 $@
6362

examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn_vllm_fsdp.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@ python3 -m verl.trainer.main_ppo \
5050
data.train_files=$HOME/data/gsm8k/train.parquet \
5151
data.val_files=$HOME/data/gsm8k/test.parquet \
5252
trainer.total_epochs=15 \
53-
actor_rollout_ref.rollout.update_weights_bucket_megabytes=512 \
5453
actor_rollout_ref.rollout.trace.token2text=False \
5554
actor_rollout_ref.rollout.mode=async \
5655
actor_rollout_ref.rollout.multi_turn.enable=true \

examples/sglang_multiturn/run_qwen3_4b_dapo_multiturn.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,6 @@ python3 -m verl.trainer.main_ppo \
7676
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
7777
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
7878
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
79-
actor_rollout_ref.rollout.update_weights_bucket_megabytes=512 \
8079
actor_rollout_ref.rollout.gpu_memory_utilization=0.85 \
8180
actor_rollout_ref.rollout.multi_stage_wake_up=True \
8281
actor_rollout_ref.rollout.multi_turn.enable=True \

tests/experimental/agent_loop/test_standalone_rollout.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ async def test_standalone_rollout(init_config, tp_size):
5252
"NCCL_DEBUG": "WARN",
5353
"VLLM_LOGGING_LEVEL": "INFO",
5454
"VLLM_USE_V1": "1",
55+
"NCCL_P2P_DISABLE": "1", # disable p2p in L20
5556
}
5657
}
5758
)

tests/special_e2e/ppo_trainer/run_function_reward.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ ROLLOUT_MODE="async"
2121
RETURN_RAW_CHAT="True"
2222
SKIP_TOKENIZER_INIT="True"
2323

24-
GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.8}
24+
GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.7}
2525
ACTOR_FSDP_PARAM_OFFLOAD=${ACTOR_FSDP_PARAM_OFFLOAD:-False}
2626
ACTOR_FSDP_OPTIMIZER_OFFLOAD=${ACTOR_FSDP_OPTIMIZER_OFFLOAD:-False}
2727
REF_FSDP_PARAM_OFFLOAD=${REF_FSDP_PARAM_OFFLOAD:-True}

0 commit comments

Comments
 (0)