[fully_async, rollout, trainer, tool, cfg] fix: ROCm async training compatibility for AMD MI300X (#6062)

xiaohong42 · root · web-flow · commit 365df242f448 · 2026-04-20T16:26:54.000+08:00
## What does this PR do? Fix multiple issues that prevent fully async FSDP2 training from working on AMD ROCm platforms (MI300X series). **Environment:** - AMD Instinct MI3xx (8× GPU, 192 GB HBM each), ROCm 7.2, PyTorch 2.10+rocm7.2, vLLM v0.18.1rc1 - Cross-validated on NVIDIA H20 with CUDA (no regression observed) **Training curves (MI3xx vs H20) and training script will be attached in PR comments.** [dapo_7b_fully_async.sh](https://github.com/user-attachments/files/26700697/dapo_7b_fully_async.sh) <img width="2234" height="1181" alt="qwen2 5_7b_fully_async_1" src="https://github.com/user-attachments/assets/81bd6651-9f1e-4450-b8b9-68149264536f" /> ### Checklist Before Starting - [x] Search for similar PRs: https://github.com/verl-project/verl/pulls?q=is%3Apr+rocm+async - [x] Format: `[{modules}] {type}: {description}` ### Test Validated by full async FSDP2 DAPO/GRPO RL + ReTool training on AMD MI3xx: - 140+ training steps completed without errors, deadlocks or OOM - Reward improved from -0.8 to 0 over 12+ hours of training - Cross-validated on NVIDIA H20: all changes are platform-safe (AMD-specific env vars are ignored on NVIDIA/Ascend; ZMQ handle changes use platform-independent rank logic) ### API and Usage Example No API changes. All fixes are internal implementation details. ### Design & Code Changes 1. **Add `HSA_NO_SCRATCH_RECLAIM` env var** (`constants_ppo.py`) - Required by AMD RCCL on MI300X; without it, FSDP initialization fails with `ncclSystemError` - Added alongside existing platform-specific vars (HCCL for Ascend); ignored on non-AMD platforms 2. **Fix `numpy.bool_` JSON serialization** (`ray_trainer.py`) - Add `default=str` fallback for `json.dumps` since numpy 2.x `bool_` is no longer a Python `bool` subclass 3. **ZMQ IPC handle: use `(replica_rank, local_rank)` instead of GPU UUID** (`vllm_rollout.py`, `utils.py`, `vllm_async_server.py`) - On ROCm, `CheckpointEngineWorker` and vLLM worker see different GPU UUIDs due to different `CUDA_VISIBLE_DEVICES`/`HIP_VISIBLE_DEVICES` settings - Sender uses `rollout_rank % local_world_size` to derive node-local rank, matching vLLM worker's `local_rank` on every node (fixes multi-node mismatch) - `replica_rank` prefix avoids socket collisions when multiple replicas share a node - `VERL_REPLICA_RANK` env var is set in `vLLMHttpServer.__init__` and inherited by vLLM worker subprocesses - Both `vLLMColocateWorkerExtension` and `vLLMOmniColocateWorkerExtension` are updated 4. **Clean up stale ZMQ IPC socket files** (`bucketed_weight_transfer.py`) - Remove leftover `.sock` files before `bind()` and after `_cleanup()` to prevent `Address already in use` on restart 5. **Fix Hydra searchpath** (`fully_async_ppo_trainer.yaml`) - Use `pkg://verl.trainer.config` instead of `file://verl/trainer/config` for editable installs 6. **Sandbox Ray actor reuse** (`sandbox_fusion_tools.py`) - Add `name` and `get_if_exists=True` to prevent duplicate `ExecutionWorker` actor creation ### Platform Compatibility All changes are safe for non-AMD platforms: - `HSA_NO_SCRATCH_RECLAIM`: AMD-specific env var, silently ignored on NVIDIA/Ascend - ZMQ handle changes: use platform-independent rank arithmetic; `VERL_REPLICA_RANK` defaults to `"0"` for single-replica setups - Other changes (json default=str, IPC cleanup, Hydra pkg://, actor reuse) are pure logic improvements ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). The official documents will be compiled after the merger. - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: These fixes target ROCm-specific runtime behavior (HIP memory management, RCCL env vars, GPU UUID mismatch) that cannot be reproduced in CI without AMD GPU hardware. - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Co-authored-by: root <root@mi308-ccs-aus-e04-40.prov.aus.ccs.cpe.ice.amd.com>
diff --git a/verl/experimental/fully_async_policy/config/fully_async_ppo_trainer.yaml b/verl/experimental/fully_async_policy/config/fully_async_ppo_trainer.yaml
@@ -1,6 +1,6 @@
 hydra:
   searchpath:
-    - file://verl/trainer/config
+    - pkg://verl.trainer.config
 
 defaults:
   - ppo_trainer
diff --git a/verl/tools/sandbox_fusion_tools.py b/verl/tools/sandbox_fusion_tools.py
@@ -90,7 +90,7 @@ def init_execution_pool(
     if mode == PoolMode.ThreadMode:
         return (
             ray.remote(ExecutionWorker)
-            .options(max_concurrency=num_workers)
+            .options(name="sandbox-execution-pool", get_if_exists=True, max_concurrency=num_workers)
             .remote(enable_global_rate_limit=enable_global_rate_limit, rate_limit=rate_limit)
         )
     else:
diff --git a/verl/trainer/constants_ppo.py b/verl/trainer/constants_ppo.py
@@ -31,6 +31,7 @@
         # https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/maintenref/envvar/envref_07_0143.html
         "HCCL_HOST_SOCKET_PORT_RANGE": "auto",
         "HCCL_NPU_SOCKET_PORT_RANGE": "auto",
+        "HSA_NO_SCRATCH_RECLAIM": "1",
     },
 }
 
diff --git a/verl/trainer/ppo/ray_trainer.py b/verl/trainer/ppo/ray_trainer.py
@@ -422,7 +422,7 @@ def _dump_generations(self, inputs, outputs, gts, scores, reward_extra_infos_dic
         lines = []
         for i in range(n):
             entry = {k: v[i] for k, v in base_data.items()}
-            lines.append(json.dumps(entry, ensure_ascii=False))
+            lines.append(json.dumps(entry, ensure_ascii=False, default=str))
 
         with open(filename, "w") as f:
             f.write("\n".join(lines) + "\n")
diff --git a/verl/workers/rollout/vllm_rollout/bucketed_weight_transfer.py b/verl/workers/rollout/vllm_rollout/bucketed_weight_transfer.py
@@ -155,6 +155,12 @@ async def async_send_weights(self, weights):
 
     def _init_socket(self):
         """Initialize ZMQ REQ socket and bind."""
+        if self.zmq_handle.startswith("ipc://"):
+            ipc_path = self.zmq_handle[len("ipc://") :]
+            try:
+                os.remove(ipc_path)
+            except OSError:
+                pass
         self.socket = self.zmq_context.socket(zmq.REQ)
         self.socket.bind(self.zmq_handle)
 
@@ -185,6 +191,12 @@ def _cleanup(self):
         if self.socket is not None:
             self.socket.close()
             self.socket = None
+        if self.zmq_handle.startswith("ipc://"):
+            ipc_path = self.zmq_handle[len("ipc://") :]
+            try:
+                os.remove(ipc_path)
+            except OSError:
+                pass
         del self.buffer
         self.buffer = None
         if self.shm is not None:
diff --git a/verl/workers/rollout/vllm_rollout/utils.py b/verl/workers/rollout/vllm_rollout/utils.py
@@ -264,10 +264,13 @@ def _update_weights(self, weights: list[tuple[str, torch.Tensor]], peft_config:
                 self.model_runner.model.load_weights(weights)
 
     def _get_zmq_handle(self) -> str:
-        """Get ZMQ handle for communication."""
-        if not hasattr(self, "device_uuid") or not self.device_uuid:
-            self.device_uuid = get_device_uuid(self.device.index)
-        return f"ipc:///tmp/rl-colocate-zmq-{self.device_uuid}.sock"
+        """Get ZMQ handle for communication.
+        Uses replica_rank + local_rank to form handle so it matches the sender side
+        regardless of CUDA_VISIBLE_DEVICES differences, and avoids collisions
+        when multiple replicas share the same node.
+        """
+        replica_rank = os.environ.get("VERL_REPLICA_RANK", "0")
+        return f"ipc:///tmp/rl-colocate-zmq-replica-{replica_rank}-rank-{self.local_rank}.sock"
 
 
 class vLLMOmniColocateWorkerExtension(_OmniWorkerBase):
@@ -330,10 +333,13 @@ def _update_weights(self, weights: list[tuple[str, torch.Tensor]], peft_config:
             self.load_weights(weights)
 
     def _get_zmq_handle(self) -> str:
-        """Get ZMQ handle for communication."""
-        if not hasattr(self, "device_uuid") or not self.device_uuid:
-            self.device_uuid = get_device_uuid(self.device.index)
-        return f"ipc:///tmp/rl-colocate-zmq-{self.device_uuid}.sock"
+        """Get ZMQ handle for communication.
+        Uses replica_rank + local_rank to form handle so it matches the sender side
+        regardless of CUDA_VISIBLE_DEVICES differences, and avoids collisions
+        when multiple replicas share the same node.
+        """
+        replica_rank = os.environ.get("VERL_REPLICA_RANK", "0")
+        return f"ipc:///tmp/rl-colocate-zmq-replica-{replica_rank}-rank-{self.local_rank}.sock"
 
 
 class SuppressSignalInThread:
diff --git a/verl/workers/rollout/vllm_rollout/vllm_async_server.py b/verl/workers/rollout/vllm_rollout/vllm_async_server.py
@@ -108,6 +108,7 @@ def __init__(
             cuda_visible_devices (str): cuda visible devices.
         """
         os.environ[get_visible_devices_keyword()] = cuda_visible_devices
+        os.environ["VERL_REPLICA_RANK"] = str(replica_rank)
 
         self.config = self._init_config(config)
         self.model_config = self._init_model_config(model_config)
diff --git a/verl/workers/rollout/vllm_rollout/vllm_rollout.py b/verl/workers/rollout/vllm_rollout/vllm_rollout.py
@@ -95,7 +95,14 @@ def __init__(
             self.sleep_level = VLLM_SLEEP_LEVEL
 
         self.device_uuid = get_device_uuid(get_device_id())
-        self.zmq_handle = f"ipc:///tmp/rl-colocate-zmq-{self.device_uuid}.sock"
+        # Use replica_rank + node-local rank to form ZMQ handle instead of GPU UUID,
+        # because CheckpointEngineWorker and vLLM worker may see different GPU UUIDs
+        # when CUDA_VISIBLE_DEVICES differs between processes (common on ROCm/AMD).
+        # Must use node-local rank (not rollout_rank) so it matches vLLM worker's
+        # local_rank on every node. Include replica_rank to avoid collisions when
+        # multiple replicas share a node.
+        local_rank = self.rollout_rank % local_world_size
+        self.zmq_handle = f"ipc:///tmp/rl-colocate-zmq-replica-{self.replica_rank}-rank-{local_rank}.sock"
 
         self.use_shm = not is_support_ipc()
         if self.use_shm:

Original file line number	Diff line number	Diff line change
`@@ -90,7 +90,7 @@ def init_execution_pool(`
`90`	`90`	`if mode == PoolMode.ThreadMode:`
`91`	`91`	`return (`
`92`	`92`	`ray.remote(ExecutionWorker)`
`93`		`- .options(max_concurrency=num_workers)`
	`93`	`+ .options(name="sandbox-execution-pool", get_if_exists=True, max_concurrency=num_workers)`
`94`	`94`	`.remote(enable_global_rate_limit=enable_global_rate_limit, rate_limit=rate_limit)`
`95`	`95`	`)`
`96`	`96`	`else:`
Original file line number	Diff line number	Diff line change
`@@ -31,6 +31,7 @@`
`31`	`31`	`# https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/maintenref/envvar/envref_07_0143.html`
`32`	`32`	`"HCCL_HOST_SOCKET_PORT_RANGE": "auto",`
`33`	`33`	`"HCCL_NPU_SOCKET_PORT_RANGE": "auto",`
	`34`	`+ "HSA_NO_SCRATCH_RECLAIM": "1",`
`34`	`35`	`},`
`35`	`36`	`}`
`36`	`37`