[BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation (verl-project#4280)

jianjunzhong · wuxibin89 · web-flow · commit e9405d70b77d · 2026-01-24T01:06:01.000+08:00
### What does this PR do? Refactor vLLM co-located training-inference rollout from single-process to multi-process architecture. This refactoring separates training and inference into different processes, enabling better resource isolation and paving the way for future checkpoint-engine integration (in roadmap verl-project#3624). **Key Changes:** - Transform `vLLMAsyncRollout` into `ServerAdapter` - a client-side adapter that communicates with the inference executor - Remove `ExternalZeroMQDistributedExecutor` and use `MultiprocExecutor` as the inference backend - Implement CUDA IPC-based weight updates via ZeroMQ for efficient parameter synchronization between training and inference processes ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example This refactoring maintains full backward compatibility with existing vLLM rollout APIs. No changes are required to user code. **Key API Components:** * **ServerAdapter** (replaces `vLLMAsyncRollout`): - Acts as client-side adapter for communicating with inference executor - Manages CUDA IPC-based weight updates - Provides same interface as previous `vLLMAsyncRollout` class ### Design #### Architecture Overview 1. Before (Single-Process Architecture) * Single-Process Design In the original `AsyncActorRolloutRefWorker`, the training engine and inference engine shared the same process. The vLLM inference engine directly received weight updates through parameter passing. ![single](https://github.com/user-attachments/assets/c3ff858f-f33e-4eb7-98c5-083c5b679d62) * Communication Architecture `ExternalZeroMQDistributedExecutor` acts as a client, sending instructions to all `AsyncActorRolloutRefWorker` inference engines via ZMQ to execute operations like `init_worker`, `load_model`, `init_device`, and `generate`. Operations like `wake_up`, `sleep`, and weight updates were executed directly in `vLLMAsyncRollout` without going through `ExternalZeroMQDistributedExecutor`. ![single_comm](https://github.com/user-attachments/assets/2be913c0-9b87-4281-bac2-1460e946b702) 2. After (Multi-Process Architecture): * Multi-Process Design Transform `vLLMAsyncRollout` into `ServerAdapter`, serving as a client for communicating with the inference engine (AsyncLLM). Weight updates are based on CUDA IPC, passing through ZeroMQ to the inference engine. ![multi](https://github.com/user-attachments/assets/51102b97-f74b-4cda-8a56-5effd2c64539) * Communication Architecture Deprecate the original `ExternalZeroMQDistributedExecutor` class and directly use vllm's `MultiprocExecutor` by passing `distributed_executor_backend = "mp"`. All inference engine operations are uniformly broadcast to all inference workers through `MultiprocExecutor`'s RPC Broadcast MQ. ![multi_comm](https://github.com/user-attachments/assets/4a98cba4-89d0-432e-94dd-040a20877363) ### Convergence test - model: Qwen3-VL-30B-A3B-Instruct - dataset: geo3k - GPU: 4*8 H100 <img width="660" height="618" alt="image" src="https://github.com/user-attachments/assets/6e3e7dbd-03f9-471a-b8d5-bc0344dba299" /> ### Performance test: update weights - CUDA IPC bucket_size: 2GB - GPU: H100, ConnectX-7 400 Gbps (InfiniBand) | Model | #GPU | Parallelism | Time | |---|---|---|---| |Qwen3-VL-30B-A3B-Instruct|TP2,EP8|4*8|5s| |DeepSeek-V3.1-Terminus|TP8, PP16, EP8| 16*8 | 120s | |DeepSeek-V3.1-Terminus|TP16,PP16| 32*8 | 80s| ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Signed-off-by: jianjunzhong <jianjunzhong@foxmail.com> Co-authored-by: wuxibin <wuxibin@bytedance.com>
diff --git a/.github/workflows/e2e_ascend.yml b/.github/workflows/e2e_ascend.yml
@@ -257,8 +257,9 @@ jobs:
       - name: Preprocess gsm8k dataset
         run: |
           python examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/.cache/datasets/openai/gsm8k
-      - name: Running the E2E test with one_step_off_policy algorithm on ASCEND NPU (FSDP2)
-        run: |
-          ray stop --force
-          bash tests/special_npu/run_one_step_off_policy.sh
-          rm -rf $HOME/ckpts
+      # TODO(wuxibin): temporary disable until we refactor with checkpoint engine
+      # - name: Running the E2E test with one_step_off_policy algorithm on ASCEND NPU (FSDP2)
+      #   run: |
+      #     ray stop --force
+      #     bash tests/special_npu/run_one_step_off_policy.sh
+      #     rm -rf $HOME/ckpts
diff --git a/.github/workflows/npu_unit_tests.yml b/.github/workflows/npu_unit_tests.yml
@@ -12,7 +12,7 @@
 # - `special_sanity`: a suite of quick sanity tests
 # - `special_standalone`: a set of test that are designed to run in dedicated environments
 
-# Accelerators for tests 
+# Accelerators for tests
 # - By default tests are run with GPU available, except for the ones under `special_npu`, and any test script whose name ends with `on_cpu.py`.
 # - For test scripts with `on_cpu.py` name suffix would be tested on CPU resources in linux environment.
 
@@ -67,7 +67,7 @@ concurrency:
   cancel-in-progress: ${{ github.ref != 'refs/heads/main' }}
 
 # Declare permissions just read content.
-permissions: 
+permissions:
   contents: read
 
 jobs:
@@ -109,7 +109,7 @@ jobs:
       - name: Run all NPU unit tests
         run: |
           export PYTHONPATH=$PYTHONPATH:/Megatron-LM
-          pytest -s -x --ignore-glob="*test_special_*.py" --ignore-glob="*on_cpu.py" --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob="tests/special*" --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" --ignore-glob="*test_rvdz*" --ignore-glob="*test_ray_collectives*" --ignore-glob="*test_nvtx_profile*" --ignore-glob="*test_nccl*" --ignore-glob="*test_nixl*" tests/
+          pytest -s -x --ignore-glob="*test_special_*.py" --ignore-glob="*on_cpu.py" --ignore-glob="*test_vllm*" --ignore-glob="*_sglang*" --ignore-glob="*_hf_rollout*" --ignore-glob="tests/models/" --ignore-glob="tests/special*" --ignore-glob="tests/experimental" --ignore-glob="tests/workers/reward_model" --ignore-glob="*test_rvdz*" --ignore-glob="*test_ray_collectives*" --ignore-glob="*test_nvtx_profile*" --ignore-glob="tests/checkpoint_engine" tests/
       - name: Testing FSDP2 actor functionality
         run: |
           torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/actor/test_special_dp_actor.py
@@ -118,4 +118,4 @@ jobs:
           torchrun --standalone --nnodes=1 --nproc-per-node=2 tests/workers/critic/test_special_dp_critic.py
       - name: Running NPU profiling unit tests
         run: |
-          pytest -s -x tests/utils/test_special_mstx_profile.py
+          pytest -s -x tests/utils/test_special_mstx_profile.py
diff --git a/.github/workflows/stash/e2e_fully_async_policy.yml b/.github/workflows/stash/e2e_fully_async_policy.yml
diff --git a/.github/workflows/stash/e2e_one_step_off_policy.yml b/.github/workflows/stash/e2e_one_step_off_policy.yml
diff --git a/examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh b/examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh
@@ -63,6 +63,5 @@ python3 -m verl.trainer.main_ppo \
     data.train_files=$HOME/data/gsm8k/train.parquet \
     data.val_files=$HOME/data/gsm8k/test.parquet \
     actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \
-    trainer.total_epochs=15 \
-    actor_rollout_ref.rollout.update_weights_bucket_megabytes=512 $@
+    trainer.total_epochs=15 $@
 
diff --git a/examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn_server.sh b/examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn_server.sh
@@ -58,6 +58,5 @@ python3 -m verl.trainer.main_ppo \
     data.train_files=$HOME/data/gsm8k/train.parquet \
     data.val_files=$HOME/data/gsm8k/test.parquet \
     actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \
-    trainer.total_epochs=15 \
-    actor_rollout_ref.rollout.update_weights_bucket_megabytes=512 $@
+    trainer.total_epochs=15 $@
 
diff --git a/examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn_vllm_fsdp.sh b/examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn_vllm_fsdp.sh
@@ -50,7 +50,6 @@ python3 -m verl.trainer.main_ppo \
     data.train_files=$HOME/data/gsm8k/train.parquet \
     data.val_files=$HOME/data/gsm8k/test.parquet \
     trainer.total_epochs=15 \
-    actor_rollout_ref.rollout.update_weights_bucket_megabytes=512 \
     actor_rollout_ref.rollout.trace.token2text=False \
     actor_rollout_ref.rollout.mode=async \
     actor_rollout_ref.rollout.multi_turn.enable=true \
diff --git a/examples/sglang_multiturn/run_qwen3_4b_dapo_multiturn.sh b/examples/sglang_multiturn/run_qwen3_4b_dapo_multiturn.sh
@@ -76,7 +76,6 @@ python3 -m verl.trainer.main_ppo \
     actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
     actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
     actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
-    actor_rollout_ref.rollout.update_weights_bucket_megabytes=512 \
     actor_rollout_ref.rollout.gpu_memory_utilization=0.85 \
     actor_rollout_ref.rollout.multi_stage_wake_up=True \
     actor_rollout_ref.rollout.multi_turn.enable=True \
diff --git a/tests/experimental/agent_loop/test_standalone_rollout.py b/tests/experimental/agent_loop/test_standalone_rollout.py
@@ -52,6 +52,7 @@ async def test_standalone_rollout(init_config, tp_size):
                 "NCCL_DEBUG": "WARN",
                 "VLLM_LOGGING_LEVEL": "INFO",
                 "VLLM_USE_V1": "1",
+                "NCCL_P2P_DISABLE": "1",  # disable p2p in L20
             }
         }
     )
diff --git a/tests/special_e2e/ppo_trainer/run_function_reward.sh b/tests/special_e2e/ppo_trainer/run_function_reward.sh
@@ -21,7 +21,7 @@ ROLLOUT_MODE="async"
 RETURN_RAW_CHAT="True"
 SKIP_TOKENIZER_INIT="True"
 
-GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.8}
+GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.7}
 ACTOR_FSDP_PARAM_OFFLOAD=${ACTOR_FSDP_PARAM_OFFLOAD:-False}
 ACTOR_FSDP_OPTIMIZER_OFFLOAD=${ACTOR_FSDP_OPTIMIZER_OFFLOAD:-False}
 REF_FSDP_PARAM_OFFLOAD=${REF_FSDP_PARAM_OFFLOAD:-True}
diff --git a/tests/special_e2e/run_ppo_trainer_megatron.sh b/tests/special_e2e/run_ppo_trainer_megatron.sh
@@ -196,7 +196,6 @@ python3 -m verl.trainer.main_ppo --config-path=config \
     actor_rollout_ref.rollout.tensor_model_parallel_size=$ROLLOUT_TP \
     actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
     actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
-    actor_rollout_ref.rollout.update_weights_bucket_megabytes=128 \
     ++actor_rollout_ref.rollout.quantization=${ROLLOUT_QUANTIZATION} \
     actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
     actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${train_traj_micro_bsz_per_gpu} \
diff --git a/tests/trainer/config/legacy_ppo_megatron_trainer.yaml b/tests/trainer/config/legacy_ppo_megatron_trainer.yaml
@@ -175,7 +175,7 @@ actor_rollout_ref:
     tensor_model_parallel_size: 2
     max_num_batched_tokens: 8192
     max_model_len: null
-    max_num_seqs: 1024
+    max_num_seqs: 256
     log_prob_micro_batch_size: null # will be deprecated, use log_prob_micro_batch_size_per_gpu
     log_prob_micro_batch_size_per_gpu: null
     log_prob_use_dynamic_bsz: ${actor_rollout_ref.actor.use_dynamic_bsz}
diff --git a/tests/trainer/config/legacy_ppo_trainer.yaml b/tests/trainer/config/legacy_ppo_trainer.yaml
@@ -466,7 +466,7 @@ actor_rollout_ref:
     max_model_len: null
 
     # max length of sequences
-    max_num_seqs: 1024
+    max_num_seqs: 256
 
     # [Will be deprecated, use log_prob_micro_batch_size_per_gpu] The batch size for one forward pass in the computation of log_prob. Global batch size.
     log_prob_micro_batch_size: null
diff --git a/verl/experimental/reward_loop/router/inner_sglang_router.py b/verl/experimental/reward_loop/router/inner_sglang_router.py
@@ -21,7 +21,7 @@
 import requests
 from sglang_router.launch_server import RouterArgs, launch_router
 
-from verl.workers.rollout.utils import get_free_port, is_valid_ipv6_address
+from verl.utils.net_utils import get_free_port, is_valid_ipv6_address
 
 logger = logging.getLogger(__name__)
 logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
diff --git a/verl/experimental/reward_loop/router/naive_router.py b/verl/experimental/reward_loop/router/naive_router.py
@@ -25,7 +25,7 @@
 from fastapi import FastAPI, Request
 from fastapi.responses import JSONResponse
 
-from verl.workers.rollout.utils import get_free_port, is_valid_ipv6_address
+from verl.utils.net_utils import get_free_port, is_valid_ipv6_address
 
 logger = logging.getLogger(__name__)
 logger.setLevel(os.getenv("VERL_LOGGING_LEVEL", "WARN"))
diff --git a/verl/trainer/config/_generated_ppo_megatron_trainer.yaml b/verl/trainer/config/_generated_ppo_megatron_trainer.yaml
@@ -218,7 +218,7 @@ actor_rollout_ref:
     pipeline_model_parallel_size: 1
     max_num_batched_tokens: 8192
     max_model_len: null
-    max_num_seqs: 1024
+    max_num_seqs: 256
     enable_chunked_prefill: true
     enable_prefix_caching: true
     logprobs_mode: processed_logprobs
@@ -268,7 +268,7 @@ actor_rollout_ref:
         _target_: verl.workers.config.CustomAsyncServerConfig
         path: null
         name: null
-    update_weights_bucket_megabytes: 512
+    update_weights_bucket_megabytes: 2048
     trace:
       _target_: verl.workers.config.TraceConfig
       backend: null
diff --git a/verl/trainer/config/_generated_ppo_trainer.yaml b/verl/trainer/config/_generated_ppo_trainer.yaml
@@ -209,7 +209,7 @@ actor_rollout_ref:
     pipeline_model_parallel_size: 1
     max_num_batched_tokens: 8192
     max_model_len: null
-    max_num_seqs: 1024
+    max_num_seqs: 256
     enable_chunked_prefill: true
     enable_prefix_caching: true
     logprobs_mode: processed_logprobs
@@ -259,7 +259,7 @@ actor_rollout_ref:
         _target_: verl.workers.config.CustomAsyncServerConfig
         path: null
         name: null
-    update_weights_bucket_megabytes: 512
+    update_weights_bucket_megabytes: 2048
     trace:
       _target_: verl.workers.config.TraceConfig
       backend: null
diff --git a/verl/trainer/config/rollout/rollout.yaml b/verl/trainer/config/rollout/rollout.yaml
@@ -65,7 +65,7 @@ max_num_batched_tokens: 8192
 max_model_len: null
 
 # max length of sequences
-max_num_seqs: 1024
+max_num_seqs: 256
 
 # may get higher throughput when set to True. When activated, Please increase max_num_batched_tokens or decrease max_model_len.
 enable_chunked_prefill: True
@@ -253,7 +253,7 @@ agent:
 # 1. Enable `RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES`
 # 2. Manually set `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`
 # when using Tensor Parallelism (TP) >= 8.
-update_weights_bucket_megabytes: 512
+update_weights_bucket_megabytes: 2048
 
 # trace rollout data
 trace:
diff --git a/verl/trainer/constants_ppo.py b/verl/trainer/constants_ppo.py
@@ -31,6 +31,13 @@
         # https://docs.vllm.ai/en/latest/usage/troubleshooting.html?h=nccl_cumem_enable#known-issues
         # https://github.com/vllm-project/vllm/blob/c6b0a7d3ba03ca414be1174e9bd86a97191b7090/vllm/worker/worker_base.py#L445
         "NCCL_CUMEM_ENABLE": "0",
+        # TODO: disable compile cache due to cache corruption issue
+        # https://github.com/vllm-project/vllm/issues/31199
+        "VLLM_DISABLE_COMPILE_CACHE": "1",
+        # Needed for multi-processes colocated on same NPU device
+        # https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/maintenref/envvar/envref_07_0143.html
+        "HCCL_HOST_SOCKET_PORT_RANGE": "auto",
+        "HCCL_NPU_SOCKET_PORT_RANGE": "auto",
     },
 }
 
diff --git a/verl/trainer/runtime_env.yaml b/verl/trainer/runtime_env.yaml
@@ -3,3 +3,5 @@ excludes: ["/.git/"]
 env_vars:
   TORCH_NCCL_AVOID_RECORD_STREAMS: "1"
   CUDA_DEVICE_MAX_CONNECTIONS: "1"
+  HCCL_HOST_SOCKET_PORT_RANGE: "auto"
+  HCCL_NPU_SOCKET_PORT_RANGE: "auto"
diff --git a/verl/utils/device.py b/verl/utils/device.py
@@ -43,6 +43,14 @@ def is_torch_npu_available(check_device=True) -> bool:
 is_npu_available = is_torch_npu_available()
 
 
+def get_resource_name() -> str:
+    """Function that return ray resource name based on the device type.
+    Returns:
+        ray resource name string, either "GPU" or "NPU".
+    """
+    return "GPU" if is_cuda_available else "NPU"
+
+
 def get_visible_devices_keyword() -> str:
     """Get the environment variable name for visible device selection.
 
diff --git a/verl/utils/megatron_utils.py b/verl/utils/megatron_utils.py
@@ -581,6 +581,19 @@ def _iter_opts(opt):
                         v["exp_avg"] = v["exp_avg"].to("cpu", non_blocking=True)
                     if "exp_avg_sq" in v:
                         v["exp_avg_sq"] = v["exp_avg_sq"].to("cpu", non_blocking=True)
+
+        try:
+            # Free TransformerEngine's dummy weight gradients cache
+            # https://github.com/NVIDIA/TransformerEngine/blob/release_v2.10/transformer_engine/pytorch/module/base.py#L64
+            from transformer_engine.pytorch.module.base import _dummy_wgrads
+
+            _dummy_wgrads.clear()
+        except ImportError:
+            pass
+
+        # Free Megatron-LM's global memory buffer
+        # get_global_memory_buffer().buffer.clear()
+
         gc.collect()
         get_torch_device().empty_cache()
 
diff --git a/verl/workers/actor/megatron_actor.py b/verl/workers/actor/megatron_actor.py
@@ -838,5 +838,6 @@ def update_policy(self, dataloader: Iterable[DataProto], enable_mtp: bool = Fals
                 RouterReplay.clear_global_router_replay_action()
                 RouterReplay.clear_global_indices()
 
+        self.actor_optimizer.zero_grad()
         get_torch_device().empty_cache()
         return metrics
diff --git a/verl/workers/engine/megatron/transformer_impl.py b/verl/workers/engine/megatron/transformer_impl.py
@@ -572,6 +572,7 @@ def __enter__(self):
 
     def __exit__(self, exc_type, exc_value, traceback):
         assert isinstance(self.engine, MegatronEngine)
+        self.engine.optimizer_zero_grad()
         super().__exit__(exc_type, exc_value, traceback)
 
 
diff --git a/verl/workers/engine_workers.py b/verl/workers/engine_workers.py
@@ -16,7 +16,6 @@
 from contextlib import nullcontext
 from functools import partial
 from itertools import chain
-from typing import Any, Optional
 
 import torch
 from codetiming import Timer
@@ -486,8 +485,6 @@ def init_model(self):
             self.set_dispatch_collect(mesh_name="actor", **self.actor.get_dispatch_collect())
 
         # 3. build rollout engine
-        # - vllm: vLLMAsyncRollout
-        # - sglang: ServerAdapter
         if "rollout" in self.role:
             rollout_config: RolloutConfig = omega_conf_to_dataclass(self.config.rollout)
 
@@ -595,27 +592,3 @@ async def wake_up(self):
         # important: need to manually set the random states of each tp to be identical.
         self.torch_random_states = get_torch_device().get_rng_state()
         get_torch_device().set_rng_state(self.gen_random_states)
-
-    # ============================ vLLM related ============================
-
-    @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD)
-    def get_zeromq_address(self):
-        return self.rollout.get_zeromq_address()
-
-    # ============================ SGLang related ============================
-
-    @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD, blocking=False)
-    async def chat_completion(self, json_request):
-        ret = await self.rollout.chat_completion(json_request)
-        return ret
-
-    @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD, blocking=False)
-    async def generate(
-        self,
-        prompt_ids: list[int],
-        sampling_params: dict[str, Any],
-        request_id: str,
-        image_data: Optional[list[Any]] = None,
-    ) -> list[int]:
-        ret = await self.rollout.generate(prompt_ids, sampling_params, request_id, image_data=image_data)
-        return ret
diff --git a/verl/workers/fsdp_workers.py b/verl/workers/fsdp_workers.py
@@ -21,7 +21,6 @@
 import os
 import warnings
 from dataclasses import asdict
-from typing import Any, Optional
 
 import numpy as np
 import psutil
@@ -2020,27 +2019,3 @@ async def wake_up(self):
     async def sleep(self):
         await self.trainer_mode()
         return True
-
-    # ============================ vLLM related ============================
-
-    @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD)
-    def get_zeromq_address(self):
-        return self.rollout.get_zeromq_address()
-
-    # ============================ SGLang related ============================
-
-    @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD, blocking=False)
-    async def chat_completion(self, json_request):
-        ret = await self.rollout.chat_completion(json_request)
-        return ret
-
-    @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD, blocking=False)
-    async def generate(
-        self,
-        prompt_ids: list[int],
-        sampling_params: dict[str, Any],
-        request_id: str,
-        image_data: Optional[list[Any]] = None,
-    ) -> list[int]:
-        ret = await self.rollout.generate(prompt_ids, sampling_params, request_id, image_data=image_data)
-        return ret
diff --git a/verl/workers/megatron_workers.py b/verl/workers/megatron_workers.py
@@ -19,7 +19,6 @@
 import logging
 import os
 import time
-from typing import Any, Optional
 
 import psutil
 import torch
@@ -989,30 +988,6 @@ async def sleep(self):
         await self.trainer_mode()
         return True
 
-    # ============================ vLLM related ============================
-
-    @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD)
-    def get_zeromq_address(self):
-        return self.rollout.get_zeromq_address()
-
-    # ============================ SGLang related ============================
-
-    @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD, blocking=False)
-    async def chat_completion(self, json_request):
-        ret = await self.rollout.chat_completion(json_request)
-        return ret
-
-    @register(dispatch_mode=Dispatch.DIRECT_ROLLOUT_METHOD, blocking=False)
-    async def generate(
-        self,
-        prompt_ids: list[int],
-        sampling_params: dict[str, Any],
-        request_id: str,
-        image_data: Optional[list[Any]] = None,
-    ) -> list[int]:
-        ret = await self.rollout.generate(prompt_ids, sampling_params, request_id, image_data=image_data)
-        return ret
-
 
 class CriticWorker(MegatronWorker, DistProfilerExtension):
     def __init__(self, config: McoreCriticConfig):
diff --git a/verl/workers/rollout/base.py b/verl/workers/rollout/base.py
@@ -79,7 +79,7 @@ def generate_sequences(self, prompts: DataProto) -> DataProto:
 
 
 _ROLLOUT_REGISTRY = {
-    ("vllm", "async"): "verl.workers.rollout.vllm_rollout.vLLMAsyncRollout",
+    ("vllm", "async"): "verl.workers.rollout.vllm_rollout.ServerAdapter",
     ("sglang", "async"): "verl.workers.rollout.sglang_rollout.sglang_rollout.ServerAdapter",
     ("trtllm", "async"): "verl.workers.rollout.trtllm_rollout.trtllm_rollout.ServerAdapter",
 }
diff --git a/verl/workers/rollout/sglang_rollout/async_sglang_server.py b/verl/workers/rollout/sglang_rollout/async_sglang_server.py
diff --git a/verl/workers/rollout/sglang_rollout/sglang_rollout.py b/verl/workers/rollout/sglang_rollout/sglang_rollout.py
diff --git a/verl/workers/rollout/trtllm_rollout/trtllm_async_server.py b/verl/workers/rollout/trtllm_rollout/trtllm_async_server.py
diff --git a/verl/workers/rollout/trtllm_rollout/trtllm_rollout.py b/verl/workers/rollout/trtllm_rollout/trtllm_rollout.py
diff --git a/verl/workers/rollout/utils.py b/verl/workers/rollout/utils.py
diff --git a/verl/workers/rollout/vllm_rollout/__init__.py b/verl/workers/rollout/vllm_rollout/__init__.py
diff --git a/verl/workers/rollout/vllm_rollout/utils.py b/verl/workers/rollout/vllm_rollout/utils.py
diff --git a/verl/workers/rollout/vllm_rollout/vllm_async_server.py b/verl/workers/rollout/vllm_rollout/vllm_async_server.py
diff --git a/verl/workers/rollout/vllm_rollout/vllm_rollout.py b/verl/workers/rollout/vllm_rollout/vllm_rollout.py

Original file line number	Diff line number	Diff line change
`@@ -52,6 +52,7 @@ async def test_standalone_rollout(init_config, tp_size):`
`52`	`52`	`"NCCL_DEBUG": "WARN",`
`53`	`53`	`"VLLM_LOGGING_LEVEL": "INFO",`
`54`	`54`	`"VLLM_USE_V1": "1",`
	`55`	`+ "NCCL_P2P_DISABLE": "1", # disable p2p in L20`
`55`	`56`	`}`
`56`	`57`	`}`
`57`	`58`	`)`